**DataMining Project**

In this project you will implement a solution to a text mining problem. The problem is to classify a 
document (that is, a given amount of text) as belonging to either computer science (CS) or non-CS. 



(This 
problem has applications in many areas, such as filtering email spam from non-spam.) For simplicity, our 
“document” for this project will consist of just a single sentence or part of a sentence (e.g., a phrase or a 
clause). 


1) Decide on a number of keywords in advance, and then represent a document as a vector of 
those keyword counts, where each component of the vector is integer-valued, representing the count 
(frequency) of the corresponding keyword in the document. Use a supervised learning scenario. You will 
create (choose) your own training and test data (creating the data is an important exercise). 

2) Create data 
by using randomly chosen sentences from, for example, your textbooks, works of fiction, or news. Use 
the Naïve Bayes approach.  

Case I: The keyword counts in a document vector are each binary (1 or 0, representing the presence or 
absence of that keyword in the document). 

Case II: The keyword counts are positive integers (possibly including zero).    
This project involves slightly extending one of the techniques discussed in class. Feel free to look up 
material on the Internet, but the final code should be your own work. 

Submit the source code, some (or all, if the training set is not huge) training data, five different test 
cases (sentences) and the corresponding classifications (CS or non-CS). Also, show the result 
(classifications) your algorithm produces when the training data (without the CS/non-CS label) are used 
as input (test data). 
All instructions on group work, acceptable formats for file upoad, use of packages, etc. are as in the 
previous projects. 

### **Project Solution Blueprint**

The solution involves below steps:
  
* Data Preparation
    * Decide on keywords (Scope Vocabulary)
    * Generate random sentences/phrases using subset of keywords and annotate them. Extract dataset summary
    * Case 1: Text Representation in the form of keyword presence (binary 0 or 1)
    * Case 2: Text Representation in the form of keyword coounts
    * Generate training and test datasets
*  Perform supervised classification on the dataset
* Display results of 5 training sentences and test sentences and corresponding classifications results
* Display the result of the algorithm when trained out with non-CS data but the test data has only non-CS data

**Step-1: Data Preparation**

#### Step 1.1. Decide on keywords (Scope Vocabulary)

Let us make 100 keywords (that means vector size will be 100). Out of which 50 are CS words and other 50 are non-CS words

We have extracted few common Computer Science vocabulary from the webstie and captured here. Reference: http://marvin.cs.uidaho.edu/Teaching/CS112/terms.pdf

In [6]:
computer_science_key_words = ['Abstraction', 'Cache', 'algorithm', 'Assignment', 'Syntax', 'Program', 'Nano-', 'Error', 'Thread', 'Variable', 'File', 
                              'Supercomputer', 'UNIX', 'I/O', 'database', 'Unicode', 'Source', 'sharing', 'Analog', 'Southbridge', 
                              'Kilo-', 'Boot', 'drive', 'Operator', 'Disk', 'Motherboard', 'Peta-', 'Exception', 'Clock', 
                              'Emacs', 'RAM', 'Comment', 'Memory', 'Base', 'IT', 'Nest', 'Boolean', 'Machine', 'HW', 'Software', 
                              'Time', 'programming', 'Call', 'Two', 'intensive', 'Kilobyte', 'Drone', 'Telepresence', 'Chip', 'Computing']
len(computer_science_key_words)

50

In [7]:
non_cs_keywords = ["academe",	"accused",	"amazement", "Sachin", "Johnny", "Movie", "batman", "spiderman", "superman", "tendulkar", "william", "shaksphere", "playing", "smoking", "chicken", "elephant",
"impartial",	"invulnerable",	"lackluster", "tiger", "lion", "parliment", "president", "minister", "dancing", "eating", "drinking",
"laughable",	"lonely",	"lustrous", "batsman", "bowler", "umpire", "female", "male", "man", "woman"
"madcap","majestic","mimic", "utah", "Colorado", "California", "Italy", "heaven", "france", "paris", "england", "football"
"monumental","moonbeam","noiseless"]
len(non_cs_keywords)

50

In [8]:
all_key_words_vocabulary = computer_science_key_words + non_cs_keywords
len(set(all_key_words_vocabulary)), len(all_key_words_vocabulary)

(100, 100)

#### Step 1.2 Generate Random Sentences using these keywords and annotate them. Extract Summary

- For generating random sentences, we will be using neural network langauge models (BERT Transformers). We will be using existing library called 'keytotext' library.

- We will be generating 200 sentences each for CS keywords and Non-CS keywords and label them as 'CS' and 'Non-CS'

In [None]:
# install keytotext  Reference: https://github.com/gagan3012/keytotext
#!pip install keytotext

In [None]:
from keytotext import pipeline

Global seed set to 42


In [None]:
nlp = pipeline("k2t-base")

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

In [None]:
params = {"do_sample":False, "num_beams":2, "no_repeat_ngram_size":1, "early_stopping":True}

In [None]:
import pandas as pd
import random
cs_df = pd.DataFrame(columns=['text', 'label'])
non_cs_df = pd.DataFrame(columns=['text', 'label'])

In [None]:
# generate 200 CS samples
for i in range(200):
  # pick random 5 computer science words and make a sentence with it
  target_words = random.sample(computer_science_key_words, 10)
  text_i = nlp(target_words, **params).replace("|", " ").replace('[/<unk>"', ' ')
  cs_df.loc[i, 'text'] = text_i
  cs_df.loc[i, 'label'] = 'CS'
  if i!=0 and i%100 == 0:
    print(f"Generated {i} sentences")

Generated 100 sentences


In [None]:
cs_df.shape

(200, 2)

In [None]:
cs_df.head()

Unnamed: 0,text,label
0,Analog I/O is assigned to the machine that use...,CS
1,The Ram's base is in the Township of Syntax HW.,CS
2,Syntax is a two-way video game with an algorit...,CS
3,I/O is the source of a program called Machine ...,CS
4,"The callsign Kilobyte, with the Unicode of 'Sh...",CS


In [None]:
cs_df.to_csv("cs_df.csv", index=False)

In [None]:
# generate 200 CS samples
for i in range(200):
  # pick random 5 computer science words and make a sentence with it
  target_words = random.sample(non_cs_keywords, 10)
  text_i = nlp(target_words, **params).replace("|", " ").replace('[/<unk>"', ' ')
  non_cs_df.loc[i, 'text'] = text_i
  non_cs_df.loc[i, 'label'] = 'Non-CS'
  if i!=0 and i%100 == 0:
    print(f"Generated {i} sentences")

Generated 100 sentences


In [None]:
non_cs_df.shape

(200, 2)

In [None]:
non_cs_df.head()

Unnamed: 0,text,label
0,"The moonbeam is superman, while the lustrous a...",Non-CS
1,"Invulnerable, lonely and shyly tendulkar is utah",Non-CS
2,"The bowler of the word ""lustrous"" is umpire an...",Non-CS
3,"The bowler tiger, which has a tint of white an...",Non-CS
4,Sachin's footballmonumental has a lackluster i...,Non-CS


In [None]:
non_cs_df.to_csv("non_cs_df.csv", index=False)

In [None]:
# Concat (Merge) the CS records and Non-CS records
dataset_df = pd.concat([cs_df, non_cs_df], ignore_index=True)
dataset_df = dataset_df.sample(frac=1, random_state=42)
dataset_df.reset_index(drop=True, inplace=True)
print(dataset_df.shape)
dataset_df.head()

(400, 2)


Unnamed: 0,text,label
0,Umpire was amazed by the moonbeam.,Non-CS
1,"The movie Utah, with the lackluster of ""angry""...",Non-CS
2,The Clock Unicoded for the program is part of ...,CS
3,A Tiger is a dish that can be found in the UTA...,Non-CS
4,Emacs is a source of Syntax and the abbreviati...,CS


In [2]:
# validate distribution = Balanced dataset
dataset_df.label.value_counts()

Non-CS    200
CS        200
Name: label, dtype: int64

In [None]:
dataset_df.to_csv("full_dataset.csv", index=False)

In [22]:
import pandas as pd
import random
dataset_df = pd.read_csv("full_dataset.csv")
cs_df = pd.read_csv("cs_df.csv")
non_cs_df = pd.read_csv("non_cs_df.csv")

In [None]:
# Clean text

In [82]:
for idx, row in cs_df.iterrows():
  cs_df.loc[idx, 'text'] = re.sub('[^a-zA-Z0-9 \n\.]', ' ',row['text'])
for idx, row in cs_df.iterrows():
  if "I O" in row['text']:
      cs_df.loc[idx, 'text'] = row['text'].replace("I O", "I/O")
cs_df.head()

Unnamed: 0,text,label
0,Analog I/O is assigned to the machine that use...,CS
1,The Ram s base is in the Township of Syntax HW.,CS
2,Syntax is a two way video game with an algorit...,CS
3,I/O is the source of a program called Machine ...,CS
4,The callsign Kilobyte with the Unicode of Sh...,CS


In [83]:
for idx, row in non_cs_df.iterrows():
  non_cs_df.loc[idx, 'text'] = re.sub('[^a-zA-Z0-9 \n\.]', ' ',row['text'])
for idx, row in non_cs_df.iterrows():
  if "I O" in row['text']:
      non_cs_df.loc[idx, 'text'] = row['text'].replace("I O", "I/O")
non_cs_df.head()

Unnamed: 0,text,label
0,The moonbeam is superman while the lustrous a...,Non-CS
1,Invulnerable lonely and shyly tendulkar is utah,Non-CS
2,The bowler of the word lustrous is umpire an...,Non-CS
3,The bowler tiger which has a tint of white an...,Non-CS
4,Sachin s footballmonumental has a lackluster i...,Non-CS


### **Prepare Keywords Vocabulary to be used in Vectorization**

In [84]:
# shuffle the vocabulary
random.Random(42).shuffle(all_key_words_vocabulary)

In [85]:
keywords_with_indices = {word: idx for idx, word in enumerate(all_key_words_vocabulary)}

In [86]:
# print 5 elements to check
for idx, (key, val) in enumerate(keywords_with_indices.items()):
  print(f"{key}-{val}")
  if idx == 5:
    break

Software-0
Operator-1
I/O-2
lonely-3
Southbridge-4
noiseless-5


### **Split Train and Test set**

In [87]:
# let us do 80-20 train test split with equal distribution
# let us do train test split with balanced distribution
index_split = int(0.2 * 200)
index_split

40

In [88]:
train_cs_df = cs_df[index_split:]
test_cs_df = cs_df[:index_split]
print(train_cs_df.shape, test_cs_df.shape)

(160, 2) (40, 2)


In [89]:
train_non_cs_df = non_cs_df[index_split:]
test_non_cs_df = non_cs_df[:index_split]
print(train_non_cs_df.shape, test_non_cs_df.shape)

(160, 2) (40, 2)


In [90]:
train_df = pd.concat([train_cs_df, train_non_cs_df], ignore_index=True).sample(frac=1, random_state=42)
train_df.reset_index(drop=True, inplace=True)
train_df.shape

(320, 2)

In [91]:
train_df.label.value_counts()

Non-CS    160
CS        160
Name: label, dtype: int64

In [92]:
train_df.head()

Unnamed: 0,text,label
0,The film Swarah accused is from Italy s lus...,Non-CS
1,The bowler of Colorado is male and has a varia...,Non-CS
2,The Unicode for the United States unk Super...,CS
3,The two years it was operated by the Network ...,CS
4,Supercomputer Drone is part of the Emacs syste...,CS


In [93]:
test_df = pd.concat([test_cs_df, test_non_cs_df], ignore_index=True).sample(frac=1, random_state=42)
test_df.reset_index(drop=True, inplace=True)
test_df.shape

(80, 2)

In [94]:
test_df.label.value_counts()

CS        40
Non-CS    40
Name: label, dtype: int64

In [95]:
test_df.tail(5)

Unnamed: 0,text,label
75,Disk Boolean s UNIX has an absolute support f...,CS
76,The president of the country is called Man s ...,Non-CS
77,The bowling match for the American actor who i...,Non-CS
78,The operating organization for a computer with...,CS
79,Tiger is the accused for shooting in a Dramati...,Non-CS


#### **Case 1: Prepare Text Representation in the form of keyword presence (binary 0 or 1)**

In [113]:
from sklearn.feature_extraction.text import CountVectorizer

In [114]:
case_1_count_vectorizer = CountVectorizer(vocabulary=keywords_with_indices, lowercase=False)

In [115]:
case1_count_matrix_on_train_set = case_1_count_vectorizer.fit_transform(train_df['text'].tolist())

In [116]:
case_1_train_count_array = case1_count_matrix_on_train_set.toarray()
case_1_X_train_df = pd.DataFrame(data=case_1_train_count_array,columns = case_1_count_vectorizer.get_feature_names())



In [117]:
train_df[0:5]['text'].tolist()

['The film  Swarah  accused  is from Italy s lustulkar',
 'The bowler of Colorado is male and has a variation known as superman.',
 'The Unicode for the  United States unk   Supercomputer was used with an',
 'The two years it was operated by the Network  Nano   Cache were',
 'Supercomputer Drone is part of the Emacs system  which has an Exception']

In [118]:
case_1_X_train_df.head()

Unnamed: 0,Software,Operator,I/O,lonely,Southbridge,noiseless,programming,Abstraction,Kilobyte,male,...,Johnny,batman,Italy,majestic,Telepresence,intensive,Comment,Variable,president,Emacs
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [119]:
case_1_X_train_df.max().describe()

count    100.000000
mean       0.880000
std        0.356186
min        0.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        2.000000
dtype: float64

### Note that Countvectorizer captures the keyword counts but we need only the keyword presence as binary. We will be using pandas clip function to clip the counts greater than 1 to 1.

In [120]:
case_1_X_train_df.clip(upper=1, inplace=True)

In [123]:
case_1_X_train_df.max().describe()

count    100.000000
mean       0.870000
std        0.337998
min        0.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        1.000000
dtype: float64

##### As seen above the values greater than 1 are clipped

In [128]:
# Use the CountVectorizer object trained on train set onto test set to vectorize test set
# also clip the values to 1 since we are working on case 1

case1_count_matrix_on_test_set = case_1_count_vectorizer.transform(test_df['text'].tolist())
case_1_test_count_array = case1_count_matrix_on_test_set.toarray()
case_1_X_test_df = pd.DataFrame(data=case_1_test_count_array,columns = case_1_count_vectorizer.get_feature_names())



In [129]:
case_1_X_test_df.head()

Unnamed: 0,Software,Operator,I/O,lonely,Southbridge,noiseless,programming,Abstraction,Kilobyte,male,...,Johnny,batman,Italy,majestic,Telepresence,intensive,Comment,Variable,president,Emacs
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [130]:
test_df['text'].tolist()[:4]

['The callsign Error Emacs  encode  has the  Un',
 'Analog I/O is assigned to the machine that uses a Kilo intensive',
 'Analog Number is the song for memory in Southbridge  which has a',
 'Chip Clock is a video game for RAM.']

In [131]:
case_1_X_test_df.max().describe()


count    100.000000
mean       0.670000
std        0.472582
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
dtype: float64

In [132]:
case_1_X_test_df.clip(upper=1, inplace=True)

In [133]:
case_1_X_test_df.max().describe()


count    100.000000
mean       0.670000
std        0.472582
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
dtype: float64

### **Case 2: Prepare text representation in the form of keyword counts**

In [134]:
from sklearn.feature_extraction.text import CountVectorizer

In [135]:
case_2_count_vectorizer = CountVectorizer(vocabulary=keywords_with_indices, lowercase=False)
case2_count_matrix_on_train_set = case_2_count_vectorizer.fit_transform(train_df['text'].tolist())

In [136]:
case_2_train_count_array = case2_count_matrix_on_train_set.toarray()
case_2_X_train_df = pd.DataFrame(data=case_2_train_count_array,columns = case_2_count_vectorizer.get_feature_names())



In [137]:
train_df[0:5]['text'].tolist()

['The film  Swarah  accused  is from Italy s lustulkar',
 'The bowler of Colorado is male and has a variation known as superman.',
 'The Unicode for the  United States unk   Supercomputer was used with an',
 'The two years it was operated by the Network  Nano   Cache were',
 'Supercomputer Drone is part of the Emacs system  which has an Exception']

In [138]:
case_2_X_train_df.head()

Unnamed: 0,Software,Operator,I/O,lonely,Southbridge,noiseless,programming,Abstraction,Kilobyte,male,...,Johnny,batman,Italy,majestic,Telepresence,intensive,Comment,Variable,president,Emacs
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


#### Note that for the case 2 we are not clipping which means we are keeping actual counts

In [139]:
# Use the CountVectorizer object trained on train set onto test set to vectorize test set
case2_count_matrix_on_test_set = case_2_count_vectorizer.transform(test_df['text'].tolist())
case_2_test_count_array = case2_count_matrix_on_test_set.toarray()
case_2_X_test_df = pd.DataFrame(data=case_2_test_count_array,columns = case_2_count_vectorizer.get_feature_names())



In [140]:
case_2_X_test_df.head()

Unnamed: 0,Software,Operator,I/O,lonely,Southbridge,noiseless,programming,Abstraction,Kilobyte,male,...,Johnny,batman,Italy,majestic,Telepresence,intensive,Comment,Variable,president,Emacs
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [141]:
test_df[:5]['text'].tolist()

['The callsign Error Emacs  encode  has the  Un',
 'Analog I/O is assigned to the machine that uses a Kilo intensive',
 'Analog Number is the song for memory in Southbridge  which has a',
 'Chip Clock is a video game for RAM.',
 'The Motherboard Error  is a 2nd in the series of Boots']

### Step-3 **Modeling**

Convert label column to 1 and 0

In [142]:
train_df['label'].value_counts()

Non-CS    160
CS        160
Name: label, dtype: int64

In [144]:
train_df['label'].replace(['Non-CS','CS'],[0,1],inplace=True)
train_df['label'].value_counts()

0    160
1    160
Name: label, dtype: int64

In [145]:
test_df['label'].value_counts()

CS        40
Non-CS    40
Name: label, dtype: int64

In [146]:
test_df['label'].replace(['Non-CS','CS'],[0,1],inplace=True)
test_df['label'].value_counts()

1    40
0    40
Name: label, dtype: int64

In [147]:
YTRAIN = train_df['label'].to_numpy()

In [148]:
YTEST = test_df['label'].to_numpy()

In [149]:
YTEST

array([1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0])

#### We will be using Naive Bayes Classification for this supervised problem as mentioned in the project problem statment

In [156]:
from sklearn.naive_bayes import GaussianNB
import sklearn.metrics as skmetrics

### **Case 1: Modeling**

In [151]:
gnb_case1 = GaussianNB()

In [153]:
y_pred_case1 = gnb_case1.fit(case_1_X_train_df.to_numpy(), YTRAIN).predict(case_1_X_test_df.to_numpy())

In [154]:
print("Case1: Number of mislabeled points out of a total %d points : %d"
      % (case_1_X_test_df.shape[0], (YTEST != y_pred_case1).sum()))

Case1: Number of mislabeled points out of a total 80 points : 3


In [158]:
print(skmetrics.classification_report(YTEST, y_pred_case1))

              precision    recall  f1-score   support

           0       1.00      0.93      0.96        40
           1       0.93      1.00      0.96        40

    accuracy                           0.96        80
   macro avg       0.97      0.96      0.96        80
weighted avg       0.97      0.96      0.96        80



In [159]:
print(skmetrics.accuracy_score(YTEST, y_pred_case1))

0.9625


### **Case 2: Modleing**

In [177]:
gnb_case2 = GaussianNB()

In [178]:
y_pred_case2 = gnb_case2.fit(case_2_X_train_df.to_numpy(), YTRAIN).predict(case_2_X_test_df.to_numpy())

In [179]:
print("Case1: Number of mislabeled points out of a total %d points : %d"
      % (case_2_X_test_df.shape[0], (YTEST != y_pred_case2).sum()))

Case1: Number of mislabeled points out of a total 80 points : 3


In [180]:
print(skmetrics.classification_report(YTEST, y_pred_case2))

              precision    recall  f1-score   support

           0       1.00      0.93      0.96        40
           1       0.93      1.00      0.96        40

    accuracy                           0.96        80
   macro avg       0.97      0.96      0.96        80
weighted avg       0.97      0.96      0.96        80



In [181]:
print(skmetrics.accuracy_score(YTEST, y_pred_case2))

0.9625


### **Step-4: To Display results of 5 training sentences and test sentences and corresponding classifications results**

### Case1: To Display results of 5 training sentences and test sentences and corresponding classifications results:

In [160]:
# 5 training sentences

In [164]:
pd.set_option('display.max_colwidth', None)

In [167]:
y_pred_case1_train = gnb_case1.predict(case_1_X_train_df.to_numpy())

In [189]:
train_df['case1_predicted'] = pd.DataFrame(y_pred_case1_train)

In [190]:
train_df.head(5)

Unnamed: 0,text,label,case2_predicted,case1_predicted
0,The film Swarah accused is from Italy s lustulkar,0,0,0
1,The bowler of Colorado is male and has a variation known as superman.,0,0,0
2,The Unicode for the United States unk Supercomputer was used with an,1,1,1
3,The two years it was operated by the Network Nano Cache were,1,1,1
4,Supercomputer Drone is part of the Emacs system which has an Exception,1,1,1


In [191]:
# 5 test sentences

In [196]:
test_df['case1_predicted'] = pd.DataFrame(y_pred_case1)

In [197]:
test_df.sample(5)

Unnamed: 0,text,label,case1_predicted
2,Analog Number is the song for memory in Southbridge which has a,1,1
78,The operating organization for a computer with the Unicode of ftare is called,1,1
29,The song unk Cache Mear Voice is an audio track of the album,1,1
22,The Emacs Clima like Telepresence system supports the cpu,1,1
34,The batsman is an example of a womanmadcap which has the soundless,0,0


### Case2: To Display results of 5 training sentences and test sentences and corresponding classifications results:

In [198]:
y_pred_case2_train = gnb_case2.predict(case_2_X_train_df.to_numpy())

In [199]:
train_df['case2_predicted'] = pd.DataFrame(y_pred_case2_train)

In [200]:
# 5 train set predictions
train_df.head(5)

Unnamed: 0,text,label,case2_predicted,case1_predicted
0,The film Swarah accused is from Italy s lustulkar,0,0,0
1,The bowler of Colorado is male and has a variation known as superman.,0,0,0
2,The Unicode for the United States unk Supercomputer was used with an,1,1,1
3,The two years it was operated by the Network Nano Cache were,1,1,1
4,Supercomputer Drone is part of the Emacs system which has an Exception,1,1,1


In [201]:
test_df['case2_predicted'] = pd.DataFrame(y_pred_case2)
test_df.sample(5)

Unnamed: 0,text,label,case1_predicted,case2_predicted
16,Parliment tiger which is in the state of California academe,0,0,0
40,The footballmonumental mimicked a politician s bowl while its male counterpart was,0,0,0
35,The memory of Error is a cache that uses the HW as an example.,1,1,1
70,The minister s parliment heart has a sinister in Colorado which,0,0,0
4,The Motherboard Error is a 2nd in the series of Boots,1,1,1


## Step - 5 Display the result of the algorithm when trained out with non-CS data but the test data has only non-CS data 

### Case 1: Display the result of the algorithm when trained out with non-CS data but the test data has only non-CS data

In [202]:
cs_data_train = cs_df.copy()
non_cs_data_test = non_cs_df.copy()


In [203]:
case_1_count_vectorizer_all_cs_train = CountVectorizer(vocabulary=keywords_with_indices, lowercase=False)
case_1_count_matrix_all_cs_train = case_1_count_vectorizer_all_cs_train.fit_transform(cs_data_train['text'].tolist())

In [204]:
case_1_train_count_array_all_cs_train = case_1_count_matrix_all_cs_train.toarray()
case_1_X_train_df_all_cs = pd.DataFrame(data=case_1_train_count_array_all_cs_train,columns = case_1_count_vectorizer_all_cs_train.get_feature_names())



In [205]:
case_1_X_train_df_all_cs.head()

Unnamed: 0,Software,Operator,I/O,lonely,Southbridge,noiseless,programming,Abstraction,Kilobyte,male,...,Johnny,batman,Italy,majestic,Telepresence,intensive,Comment,Variable,president,Emacs
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [206]:
cs_data_train[:5]['text'].tolist()

['Analog I/O is assigned to the machine that uses a Kilo intensive',
 'The Ram s base is in the Township of Syntax HW.',
 'Syntax is a two way video game with an algorithm of HW.',
 'I/O is the source of a program called Machine Bowlean  that',
 'The callsign Kilobyte  with the Unicode of  Shame  ']

In [208]:
case_1_X_train_df_all_cs.max().describe()

count    100.000000
mean       0.430000
std        0.517472
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        2.000000
dtype: float64

In [209]:
case_1_X_train_df_all_cs.clip(upper=1, inplace=True)

In [211]:
case_1_X_train_df_all_cs.max().describe() # CountVectorizer used to capture existence binary 0 or 1 by clipping upper bound count to 1

count    100.000000
mean       0.420000
std        0.496045
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
dtype: float64

In [214]:
# Use the CountVectorizer object trained on train set onto test set to vectorize test set
case1_count_matrix_on_test_set_all_noncs = case_1_count_vectorizer_all_cs_train.transform(non_cs_data_test['text'].tolist())
case_1_test_count_array_all_noncs = case1_count_matrix_on_test_set_all_noncs.toarray()
case_1_X_test_df_all_non_cs = pd.DataFrame(data=case_1_test_count_array_all_noncs,columns = case_1_count_vectorizer_all_cs_train.get_feature_names())



In [215]:
case_1_X_test_df_all_non_cs.head()

Unnamed: 0,Software,Operator,I/O,lonely,Southbridge,noiseless,programming,Abstraction,Kilobyte,male,...,Johnny,batman,Italy,majestic,Telepresence,intensive,Comment,Variable,president,Emacs
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [217]:
non_cs_data_test[:5]['text'].tolist()

['The moonbeam is superman  while the lustrous and unrelated to it ',
 'Invulnerable  lonely and shyly tendulkar is utah ',
 'The bowler of the word  lustrous  is umpire and sheadcap',
 'The bowler tiger  which has a tint of white and is otherwise known as',
 'Sachin s footballmonumental has a lackluster in bowler  which is']

In [223]:
cs_data_train['label'].value_counts()

CS    200
Name: label, dtype: int64

In [224]:
cs_data_train['label'].replace(['Non-CS','CS'],[0,1],inplace=True)
cs_data_train['label'].value_counts()

1    200
Name: label, dtype: int64

In [225]:
non_cs_data_test['label'].value_counts()

Non-CS    200
Name: label, dtype: int64

In [226]:
non_cs_data_test['label'].replace(['Non-CS','CS'],[0,1],inplace=True)
non_cs_data_test['label'].value_counts()

0    200
Name: label, dtype: int64

In [228]:
YTRAIN_all_cs = cs_data_train['label'].to_numpy()
YTEST_all_non_cs = non_cs_data_test['label'].to_numpy()

In [229]:
gnb_case1_all_cs_train = GaussianNB()
y_pred_case1_all_noncs_test = gnb_case1_all_cs_train.fit(case_1_X_train_df_all_cs.to_numpy(), YTRAIN_all_cs).predict(case_1_X_test_df_all_non_cs.to_numpy())

In [230]:
print("Case1: Number of mislabeled points out of a total %d points : %d"
      % (case_1_X_test_df_all_non_cs.shape[0], (YTEST_all_non_cs != y_pred_case1_all_noncs_test).sum()))

Case1: Number of mislabeled points out of a total 200 points : 200


In [231]:
print(skmetrics.classification_report(YTEST_all_non_cs, y_pred_case1_all_noncs_test))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00     200.0
           1       0.00      0.00      0.00       0.0

    accuracy                           0.00     200.0
   macro avg       0.00      0.00      0.00     200.0
weighted avg       0.00      0.00      0.00     200.0



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
