**DataMining Project Given**

In this project you will implement a solution to a text mining problem. The problem is to classify a 
document (that is, a given amount of text) as belonging to either computer science (CS) or non-CS. 



(This 
problem has applications in many areas, such as filtering email spam from non-spam.) For simplicity, our 
“document” for this project will consist of just a single sentence or part of a sentence (e.g., a phrase or a 
clause). 


1) Decide on a number of keywords in advance, and then represent a document as a vector of 
those keyword counts, where each component of the vector is integer-valued, representing the count 
(frequency) of the corresponding keyword in the document. Use a supervised learning scenario. You will 
create (choose) your own training and test data (creating the data is an important exercise). 

2) Create data 
by using randomly chosen sentences from, for example, your textbooks, works of fiction, or news. Use 
the Naïve Bayes approach.  

Case I: The keyword counts in a document vector are each binary (1 or 0, representing the presence or 
absence of that keyword in the document). 

Case II: The keyword counts are positive integers (possibly including zero).    
This project involves slightly extending one of the techniques discussed in class. Feel free to look up 
material on the Internet, but the final code should be your own work. 

Submit the source code, some (or all, if the training set is not huge) training data, five different test 
cases (sentences) and the corresponding classifications (CS or non-CS). Also, show the result 
(classifications) your algorithm produces when the training data (without the CS/non-CS label) are used 
as input (test data). 
All instructions on group work, acceptable formats for file upoad, use of packages, etc. are as in the 
previous projects. 

### **Our Proposal: Project Solution Blueprint**

The solution involves below steps:
  
* Data Preparation
    * Decide on keywords (Scope Vocabulary)
    * Generate random sentences/phrases using subset of keywords and annotate them. Extract dataset summary
    * Case 1: Text Representation in the form of keyword presence (binary 0 or 1)
    * Case 2: Text Representation in the form of keyword coounts
    * Generate training and test datasets
*  Perform supervised classification on the dataset
* Display results of 5 training sentences and test sentences and corresponding classifications results
* Display the result of the algorithm when trained out with non-CS data but the test data has only non-CS data

##**Step-1: Data Preparation**

#### Step 1.1. Decide on keywords (Scope Vocabulary)

Let us make 100 keywords (that means vector size (features) will be 100). Out of which 50 are CS words and other 50 are non-CS words

We have extracted few common Computer Science vocabulary from the webstie and captured here. Reference: http://marvin.cs.uidaho.edu/Teaching/CS112/terms.pdf

In [346]:
computer_science_key_words = ['Abstraction', 'Cache', 'algorithm', 'Assignment', 'Syntax', 'Program', 'Nano-', 'Error', 'Thread', 'Variable', 'File', 
                              'Supercomputer', 'UNIX', 'I/O', 'database', 'Unicode', 'Source', 'sharing', 'Analog', 'Southbridge', 
                              'Kilo-', 'Boot', 'drive', 'Operator', 'Disk', 'Motherboard', 'Peta-', 'Exception', 'Clock', 
                              'Emacs', 'RAM', 'Comment', 'Memory', 'Base', 'IT', 'Nest', 'Boolean', 'Machine', 'HW', 'Software', 
                              'Time', 'programming', 'Call', 'Two', 'intensive', 'Kilobyte', 'Drone', 'Telepresence', 'Chip', 'Computing']
len(computer_science_key_words)

50

In [347]:
non_cs_keywords = ["academe",	"accused",	"amazement", "Sachin", "Johnny", "Movie", "batman", "spiderman", "superman", "tendulkar", "william", "shaksphere", "playing", "smoking", "chicken", "elephant",
"impartial",	"invulnerable",	"lackluster", "tiger", "lion", "parliment", "president", "minister", "dancing", "eating", "drinking",
"laughable",	"lonely",	"lustrous", "batsman", "bowler", "umpire", "female", "male", "man", "woman"
"madcap","majestic","mimic", "utah", "Colorado", "California", "Italy", "heaven", "france", "paris", "england", "football"
"monumental","moonbeam","noiseless"]
len(non_cs_keywords)

50

In [348]:
all_key_words_vocabulary = computer_science_key_words + non_cs_keywords
len(set(all_key_words_vocabulary)), len(all_key_words_vocabulary)

(100, 100)

#### Step 1.2 Generate Random Sentences using these keywords and annotate them. Extract Summary

- For generating random sentences, we will be using neural network langauge models (BERT Transformers). We will be using existing library called 'keytotext' library.

- We will be generating 200 sentences each for CS keywords and Non-CS keywords and label them as 'CS' and 'Non-CS'

In [None]:
# install keytotext  Reference: https://github.com/gagan3012/keytotext
#!pip install keytotext

In [None]:
from keytotext import pipeline

Global seed set to 42


In [None]:
nlp = pipeline("k2t-base")

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

In [None]:
params = {"do_sample":False, "num_beams":2, "no_repeat_ngram_size":1, "early_stopping":True}

In [None]:
import pandas as pd
import random
cs_df = pd.DataFrame(columns=['text', 'label'])
non_cs_df = pd.DataFrame(columns=['text', 'label'])

#### Step 1.3 Generate 200 CS Samples

In [None]:
# generate 200 CS samples
for i in range(200):
  # pick random 5 computer science words and make a sentence with it
  target_words = random.sample(computer_science_key_words, 10)
  text_i = nlp(target_words, **params).replace("|", " ").replace('[/<unk>"', ' ')
  cs_df.loc[i, 'text'] = text_i
  cs_df.loc[i, 'label'] = 'CS'
  if i!=0 and i%100 == 0:
    print(f"Generated {i} sentences")

Generated 100 sentences


In [None]:
cs_df.shape

(200, 2)

In [136]:
cs_df.head()

Unnamed: 0,text,label
0,Analog I/O is assigned to the machine that uses a Kilo intensive,CS
1,The Ram s base is in the Township of Syntax HW.,CS
2,Syntax is a two way video game with an algorithm of HW.,CS
3,I/O is the source of a program called Machine Bowlean that,CS
4,The callsign Kilobyte with the Unicode of Shame,CS


#### Step 1.4: Add some noise to the generated CS sentences to obtain some repetition as per our project use case


In [243]:
import re
for idx, row in cs_df.iterrows():
  choices = [1,2,3, 4]
  selected_choice = random.sample(choices, 1)[0]
  if selected_choice == 1:
    text_to_write = row['text'] + " " + row["text"]
  elif selected_choice == 2:
    # add noise from non-CS keywords
    text_to_write  = row['text'] + " " + " ".join(random.sample(non_cs_keywords, 5))
  elif selected_choice == 3:
    text_to_write  = row['text'] + " " + row["text"][:-20]
  else:
    #add noise from non-CS keywords
    text_to_write  = " ".join(random.sample(non_cs_keywords, 5)) + " " + row['text'] + " " + " ".join(random.sample(computer_science_key_words, 5))
  text_to_write = re.sub('[^a-zA-Z0-9 \n\.]', ' ',text_to_write)
  text_to_write = text_to_write.replace("I O", "I/O")
  cs_df.loc[idx, 'text']  = text_to_write

In [244]:
cs_df[:5]['text'].tolist()

['Clock Kilo  Kilobyte Abstraction Operator Analog I/O is assigned to the machine that uses a Kilo intensive IT Boot Abstraction Chip Exception Clock Kilo  Kilobyte Abstraction Operator Analog I/O is assigned to the machine that uses a Kilo intensive IT Boot Abstra',
 'IT Call sharing Peta  Telepresence The Ram s base is in the Township of Syntax HW. Motherboard Unicode Nano  Peta  IT william majestic Johnny paris male',
 'impartial superman heaven lackluster smoking Syntax is a two way video game with an algorithm of HW. Syntax is a two way video game with UNIX Supercomputer Clock Assignment Chip',
 'Abstraction Boot Boolean Supercomputer Southbridge I/O is the source of a program called Machine Bowlean  that Time RAM Peta  Drone Assignment Abstraction Boot Boolean Supercomputer Southbridge I/O is the source of a program called Machine Bowlean  that Time RAM Peta  Drone Assignment',
 'The callsign Kilobyte  with the Unicode of  Shame   The callsign Kilobyte  with the Unicode of  Shame

In [342]:
cs_df.to_csv("cs_df.csv", index=False) # save to local drive

#### Step 1.5 Generate Non-CS Samples

In [None]:
# generate 200 Non - CS samples
for i in range(200):
  # pick random 5 computer science words and make a sentence with it
  target_words = random.sample(non_cs_keywords, 10)
  text_i = nlp(target_words, **params).replace("|", " ").replace('[/<unk>"', ' ')
  non_cs_df.loc[i, 'text'] = text_i
  non_cs_df.loc[i, 'label'] = 'Non-CS'
  if i!=0 and i%100 == 0:
    print(f"Generated {i} sentences")

Generated 100 sentences


In [None]:
non_cs_df.shape

(200, 2)

In [143]:
non_cs_df[:5]['text'].tolist()

['The moonbeam is superman  while the lustrous and unrelated to it ',
 'Invulnerable  lonely and shyly tendulkar is utah ',
 'The bowler of the word  lustrous  is umpire and sheadcap',
 'The bowler tiger  which has a tint of white and is otherwise known as',
 'Sachin s footballmonumental has a lackluster in bowler  which is']

####  Step 1.6: Add some noise to generate Non-CS sentences to match with our project requirement

In [245]:
import re
for idx, row in non_cs_df.iterrows():
  choices = [1,2,3, 4]
  selected_choice = random.sample(choices, 1)[0]
  if selected_choice == 1:
    text_to_write = row['text'] + " " + row["text"]
  elif selected_choice == 2:
    text_to_write  = row['text'] + " " + " ".join(random.sample(computer_science_key_words, 5))
  elif selected_choice == 3:
    text_to_write  = row['text'] + " " + row["text"][:-20]
  else:
    text_to_write  = " ".join(random.sample(computer_science_key_words, 5)) + " " + row['text'] + " " + " ".join(random.sample(non_cs_keywords, 5))
  text_to_write = re.sub('[^a-zA-Z0-9 \n\.]', ' ',text_to_write)
  text_to_write = text_to_write.replace("I O", "I/O")
  non_cs_df.loc[idx, 'text']  = text_to_write

In [246]:
non_cs_df[:5]['text'].tolist()

['The moonbeam is superman  while the lustrous and unrelated to it  The moonbeam is superman  while the lustrous and unrelated to it  The moonbeam is superman  while the lustrous and unrelated to it  The moonbeam is superman  while the lustrous ',
 'Invulnerable  lonely and shyly tendulkar is utah  Invulnerable  lonely and shyly tendulkar is utah  HW Drone drive Software Motherboard',
 'The bowler of the word  lustrous  is umpire and sheadcap The bowler of the word  lustrous  is umpire and sheadcap The bowler of the word  lustrous  is umpire and sheadcap The bowler of the word  lustrous  is',
 'The bowler tiger  which has a tint of white and is otherwise known as utah umpire playing lion California The bowler tiger  which has a tint of white and is otherwise known as utah umpire playing lion California',
 'Sachin s footballmonumental has a lackluster in bowler  which is Sachin s footballmonumental has a lackluster in bowler  which is Sachin s footballmonumental has a lackluster in bowl

In [343]:
non_cs_df.to_csv("non_cs_df.csv", index=False)

In [247]:
# Concat (Merge) the CS records and Non-CS records
dataset_df = pd.concat([cs_df, non_cs_df], ignore_index=True)
dataset_df = dataset_df.sample(frac=1, random_state=42)
dataset_df.reset_index(drop=True, inplace=True)
print(dataset_df.shape)
dataset_df.head()

(400, 2)


Unnamed: 0,text,label
0,Umpire was amazed by the moonbeam. invulnerable noiseless amazement Movie elephant Umpire was amazed by the moonbeam. invulnerable noiseless amazement Movie elephant,Non-CS
1,database UNIX algorithm Error Kilobyte amazement parliment California england shaksphere The movie Utah with the lackluster of angry is invul batsman heaven Sachin tendulkar impartial chicken Movie Sachin drinking batman,Non-CS
2,elephant heaven utah majestic womanmadcap The Clock Unicoded for the program is part of Analog. The Clock Unicoded for the progra Software Drone Syntax Nest programming,CS
3,Emacs UNIX Base RAM Memory batman Johnny batsman female shaksphere A Tiger is a dish that can be found in the UTAh ut bowler majestic Sachin batman Johnny noiseless lackluster paris man parliment,Non-CS
4,Emacs is a source of Syntax and the abbreviation Abtraction Emacs is a source of Syntax and the abbr Emacs is a source of Syntax and the abbreviation Abtraction Emacs is a source of Syntax and the abbr,CS


In [248]:
# validate distribution = Balanced dataset
dataset_df.label.value_counts()

Non-CS    200
CS        200
Name: label, dtype: int64

In [345]:
dataset_df.to_csv("full_dataset.csv", index=False)

#### Step 1.7  **Prepare Keywords Vocabulary to be used in Vectorization**

In [249]:
# shuffle the vocabulary
random.Random(42).shuffle(all_key_words_vocabulary)

In [250]:
keywords_with_indices = {word: idx for idx, word in enumerate(all_key_words_vocabulary)}

In [251]:
# print 5 elements to check
for idx, (key, val) in enumerate(keywords_with_indices.items()):
  print(f"{key}-{val}")
  if idx == 5:
    break

UNIX-0
england-1
batman-2
male-3
Machine-4
bowler-5


#### Step 1.8: **Split Train and Test set**

In [252]:
# let us do 80-20 train test split with equal distribution
# let us do train test split with balanced distribution
index_split = int(0.2 * 200)
index_split

40

In [253]:
train_cs_df = cs_df[index_split:]
test_cs_df = cs_df[:index_split]
print(train_cs_df.shape, test_cs_df.shape)

(160, 2) (40, 2)


In [254]:
train_non_cs_df = non_cs_df[index_split:]
test_non_cs_df = non_cs_df[:index_split]
print(train_non_cs_df.shape, test_non_cs_df.shape)

(160, 2) (40, 2)


In [255]:
train_df = pd.concat([train_cs_df, train_non_cs_df], ignore_index=True).sample(frac=1, random_state=42)
train_df.reset_index(drop=True, inplace=True)
train_df.shape

(320, 2)

In [256]:
train_df.label.value_counts()

Non-CS    160
CS        160
Name: label, dtype: int64

In [257]:
train_df.head()

Unnamed: 0,text,label
0,The film Swarah accused is from Italy s lustulkar impartial paris majestic tiger smoking The film Swarah accused is from Italy s lustulkar impartial paris ma,Non-CS
1,The bowler of Colorado is male and has a variation known as superman. The bowler of Colorado is male and has a variatio The bowler of Colorado is male and has a variation known as superman. The bowler of Colorado is male and has a variatio,Non-CS
2,heaven playing Colorado invulnerable utah The Unicode for the United States unk Supercomputer was used with an The Unicode for the United States unk Supercomputer was used with an Cache Machine Thread Program Boolean,CS
3,The two years it was operated by the Network Nano Cache were The two years it was operated by the Network Nano Cache were utah drinking spiderman lonely batsman,CS
4,California umpire academe smoking male Supercomputer Drone is part of the Emacs system which has an Exception Supercomputer Drone is part of the Emacs system wh Time Variable Nest Analog Disk,CS


In [258]:
test_df = pd.concat([test_cs_df, test_non_cs_df], ignore_index=True).sample(frac=1, random_state=42)
test_df.reset_index(drop=True, inplace=True)
test_df.shape

(80, 2)

In [259]:
test_df.label.value_counts()

CS        40
Non-CS    40
Name: label, dtype: int64

In [260]:
test_df.tail(5)

Unnamed: 0,text,label
75,Disk Boolean s UNIX has an absolute support for Telepres Disk Boolean s UNIX has an absolute Disk Boolean s UNIX has an absolute support for Telepres Disk Boolean s U,CS
76,Supercomputer Program Operator Boolean Chip Italy paris utah chicken lustrous The president of the country is called Man s Man and it has a male elephant impartial moonbeam mimic womanmadcap france accused Johnny playing Movie,Non-CS
77,Memory Analog Drone Two Thread impartial invulnerable Colorado france playing The bowling match for the American actor who is otherwise known as umpire s heaven eating shaksphere Italy chicken male umpire playing paris Sachin,Non-CS
78,The operating organization for a computer with the Unicode of ftare is called The operating organization for a computer with the Unicode The operating organization for a computer with the Unicode of ftare is called The operating organization for a computer with the Unicode,CS
79,Tiger is the accused for shooting in a Dramatic Umpire. Tiger is the accused for shooting in a Dramatic Umpire. Boolean Analog Computing Thread Memory,Non-CS


#### Step 1.9: **Case 1: Prepare Text Representation in the form of keyword presence (binary 0 or 1)**

In [261]:
from sklearn.feature_extraction.text import CountVectorizer

In [262]:
case_1_count_vectorizer = CountVectorizer(vocabulary=keywords_with_indices, lowercase=False)

In [263]:
case1_count_matrix_on_train_set = case_1_count_vectorizer.fit_transform(train_df['text'].tolist())

In [264]:
case_1_train_count_array = case1_count_matrix_on_train_set.toarray()
case_1_X_train_df = pd.DataFrame(data=case_1_train_count_array,columns = case_1_count_vectorizer.get_feature_names())



In [265]:
train_df[0:5]['text'].tolist()

['The film  Swarah  accused  is from Italy s lustulkar impartial paris majestic tiger smoking The film  Swarah  accused  is from Italy s lustulkar impartial paris ma',
 'The bowler of Colorado is male and has a variation known as superman. The bowler of Colorado is male and has a variatio The bowler of Colorado is male and has a variation known as superman. The bowler of Colorado is male and has a variatio',
 'heaven playing Colorado invulnerable utah The Unicode for the  United States unk   Supercomputer was used with an The Unicode for the  United States unk   Supercomputer was used with an Cache Machine Thread Program Boolean',
 'The two years it was operated by the Network  Nano   Cache were The two years it was operated by the Network  Nano   Cache were utah drinking spiderman lonely batsman',
 'California umpire academe smoking male Supercomputer Drone is part of the Emacs system  which has an Exception Supercomputer Drone is part of the Emacs system  wh Time Variable Nest Analog

In [266]:
case_1_X_train_df.head()

Unnamed: 0,UNIX,england,batman,male,Machine,bowler,Operator,Call,impartial,footballmonumental,...,Source,superman,sharing,eating,william,Analog,Telepresence,lonely,spiderman,Base
0,0,0,0,0,0,0,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,4,0,4,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [267]:
case_1_X_train_df.max().describe()

count    100.000000
mean       3.300000
std        1.058873
min        0.000000
25%        2.750000
50%        4.000000
75%        4.000000
max        4.000000
dtype: float64

#### **Note:**  Countvectorizer captures the keyword counts but we need only the keyword presence as binary(1/0) as per the project requirement. We will be using pandas clip function to clip the counts greater than 1 to 1 to satisfy Case 1 of this project

In [268]:
case_1_X_train_df.clip(upper=1, inplace=True)

In [269]:
case_1_X_train_df.max().describe()

count    100.000000
mean       0.960000
std        0.196946
min        0.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        1.000000
dtype: float64

##### **Note**: As seen above the values greater than 1 are clipped

In [270]:
# Use the CountVectorizer object trained on train set onto test set to vectorize test set
# also clip the values to 1 since we are working on case 1

case1_count_matrix_on_test_set = case_1_count_vectorizer.transform(test_df['text'].tolist())
case_1_test_count_array = case1_count_matrix_on_test_set.toarray()
case_1_X_test_df = pd.DataFrame(data=case_1_test_count_array,columns = case_1_count_vectorizer.get_feature_names())



In [271]:
case_1_X_test_df.head()

Unnamed: 0,UNIX,england,batman,male,Machine,bowler,Operator,Call,impartial,footballmonumental,...,Source,superman,sharing,eating,william,Analog,Telepresence,lonely,spiderman,Base
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,2,0,0,0,0
2,0,0,0,0,2,0,0,0,0,0,...,2,0,2,0,0,2,0,0,0,0
3,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [272]:
test_df['text'].tolist()[:4]

['The callsign Error Emacs  encode  has the  Un algorithm programming Kilo  sharing Time The callsign Error Emacs  encode  has the  Un algorithm programmin',
 'Clock Kilo  Kilobyte Abstraction Operator Analog I/O is assigned to the machine that uses a Kilo intensive IT Boot Abstraction Chip Exception Clock Kilo  Kilobyte Abstraction Operator Analog I/O is assigned to the machine that uses a Kilo intensive IT Boot Abstra',
 'Source Exception Machine Boolean sharing Analog Number is the song for memory in Southbridge  which has a programming Variable Drone Abstraction database Source Exception Machine Boolean sharing Analog Number is the song for memory in Southbridge  which has a programming Variable Drone Abstraction database',
 'man female amazement male parliment Chip Clock is a video game for RAM. Chip Clock is a drive HW sharing Two UNIX']

In [273]:
case_1_X_test_df.max().describe()


count    100.000000
mean       2.520000
std        1.158892
min        0.000000
25%        2.000000
50%        2.000000
75%        4.000000
max        6.000000
dtype: float64

In [274]:
case_1_X_test_df.clip(upper=1, inplace=True)

In [275]:
case_1_X_test_df.head()

Unnamed: 0,UNIX,england,batman,male,Machine,bowler,Operator,Call,impartial,footballmonumental,...,Source,superman,sharing,eating,william,Analog,Telepresence,lonely,spiderman,Base
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,1,0,1,0,0,1,0,0,0,0
3,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [276]:
case_1_X_test_df.max().describe()


count    100.000000
mean       0.960000
std        0.196946
min        0.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        1.000000
dtype: float64

#### Step 1.10:  **Case 2: Prepare text representation in the form of keyword counts**

In [277]:
from sklearn.feature_extraction.text import CountVectorizer

In [278]:
case_2_count_vectorizer = CountVectorizer(vocabulary=keywords_with_indices, lowercase=False)
case2_count_matrix_on_train_set = case_2_count_vectorizer.fit_transform(train_df['text'].tolist())

In [279]:
case_2_train_count_array = case2_count_matrix_on_train_set.toarray()
case_2_X_train_df = pd.DataFrame(data=case_2_train_count_array,columns = case_2_count_vectorizer.get_feature_names())



In [280]:
train_df[0:5]['text'].tolist()

['The film  Swarah  accused  is from Italy s lustulkar impartial paris majestic tiger smoking The film  Swarah  accused  is from Italy s lustulkar impartial paris ma',
 'The bowler of Colorado is male and has a variation known as superman. The bowler of Colorado is male and has a variatio The bowler of Colorado is male and has a variation known as superman. The bowler of Colorado is male and has a variatio',
 'heaven playing Colorado invulnerable utah The Unicode for the  United States unk   Supercomputer was used with an The Unicode for the  United States unk   Supercomputer was used with an Cache Machine Thread Program Boolean',
 'The two years it was operated by the Network  Nano   Cache were The two years it was operated by the Network  Nano   Cache were utah drinking spiderman lonely batsman',
 'California umpire academe smoking male Supercomputer Drone is part of the Emacs system  which has an Exception Supercomputer Drone is part of the Emacs system  wh Time Variable Nest Analog

In [281]:
case_2_X_train_df.head()

Unnamed: 0,UNIX,england,batman,male,Machine,bowler,Operator,Call,impartial,footballmonumental,...,Source,superman,sharing,eating,william,Analog,Telepresence,lonely,spiderman,Base
0,0,0,0,0,0,0,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,4,0,4,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


#### **Note** that for the case 2 we are not clipping which means we are keeping actual counts

In [282]:
# Use the CountVectorizer object trained on train set onto test set to vectorize test set
case2_count_matrix_on_test_set = case_2_count_vectorizer.transform(test_df['text'].tolist())
case_2_test_count_array = case2_count_matrix_on_test_set.toarray()
case_2_X_test_df = pd.DataFrame(data=case_2_test_count_array,columns = case_2_count_vectorizer.get_feature_names())



In [283]:
case_2_X_test_df.head()

Unnamed: 0,UNIX,england,batman,male,Machine,bowler,Operator,Call,impartial,footballmonumental,...,Source,superman,sharing,eating,william,Analog,Telepresence,lonely,spiderman,Base
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,2,0,0,0,0
2,0,0,0,0,2,0,0,0,0,0,...,2,0,2,0,0,2,0,0,0,0
3,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [284]:
test_df[:5]['text'].tolist()

['The callsign Error Emacs  encode  has the  Un algorithm programming Kilo  sharing Time The callsign Error Emacs  encode  has the  Un algorithm programmin',
 'Clock Kilo  Kilobyte Abstraction Operator Analog I/O is assigned to the machine that uses a Kilo intensive IT Boot Abstraction Chip Exception Clock Kilo  Kilobyte Abstraction Operator Analog I/O is assigned to the machine that uses a Kilo intensive IT Boot Abstra',
 'Source Exception Machine Boolean sharing Analog Number is the song for memory in Southbridge  which has a programming Variable Drone Abstraction database Source Exception Machine Boolean sharing Analog Number is the song for memory in Southbridge  which has a programming Variable Drone Abstraction database',
 'man female amazement male parliment Chip Clock is a video game for RAM. Chip Clock is a drive HW sharing Two UNIX',
 'The Motherboard Error  is a 2nd in the series of Boots Nano  Syntax algorithm Program Chip Colorado California laughable footballmonumental sh

## **Step - 2 Modeling**

Convert label column to 1 and 0 for model training task

In [285]:
train_df['label'].value_counts()

Non-CS    160
CS        160
Name: label, dtype: int64

In [286]:
train_df['label'].replace(['Non-CS','CS'],[0,1],inplace=True)
train_df['label'].value_counts()

0    160
1    160
Name: label, dtype: int64

In [287]:
test_df['label'].value_counts()

CS        40
Non-CS    40
Name: label, dtype: int64

In [288]:
test_df['label'].replace(['Non-CS','CS'],[0,1],inplace=True)
test_df['label'].value_counts()

1    40
0    40
Name: label, dtype: int64

In [289]:
YTRAIN = train_df['label'].to_numpy()

In [290]:
YTEST = test_df['label'].to_numpy()

In [291]:
YTEST

array([1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0])

#### **Note** We will be using Naive Bayes Classification for this supervised problem as mentioned in the project problem statment

In [292]:
from sklearn.naive_bayes import GaussianNB
import sklearn.metrics as skmetrics

#### Step 2.1 **Case 1: Modeling**

In [293]:
gnb_case1 = GaussianNB()

In [294]:
y_pred_case1 = gnb_case1.fit(case_1_X_train_df.to_numpy(), YTRAIN).predict(case_1_X_test_df.to_numpy())

In [295]:
print("Case1: Number of mislabeled points out of a total %d points : %d"
      % (case_1_X_test_df.shape[0], (YTEST != y_pred_case1).sum()))

Case1: Number of mislabeled points out of a total 80 points : 14


In [296]:
print(skmetrics.classification_report(YTEST, y_pred_case1))

              precision    recall  f1-score   support

           0       0.81      0.85      0.83        40
           1       0.84      0.80      0.82        40

    accuracy                           0.82        80
   macro avg       0.83      0.82      0.82        80
weighted avg       0.83      0.82      0.82        80



In [297]:
print(skmetrics.accuracy_score(YTEST, y_pred_case1))

0.825


#### Step 2.2 **Case 2: Modleing**

In [298]:
gnb_case2 = GaussianNB()

In [299]:
y_pred_case2 = gnb_case2.fit(case_2_X_train_df.to_numpy(), YTRAIN).predict(case_2_X_test_df.to_numpy())

In [300]:
print("Case1: Number of mislabeled points out of a total %d points : %d"
      % (case_2_X_test_df.shape[0], (YTEST != y_pred_case2).sum()))

Case1: Number of mislabeled points out of a total 80 points : 3


In [301]:
print(skmetrics.classification_report(YTEST, y_pred_case2))

              precision    recall  f1-score   support

           0       0.95      0.97      0.96        40
           1       0.97      0.95      0.96        40

    accuracy                           0.96        80
   macro avg       0.96      0.96      0.96        80
weighted avg       0.96      0.96      0.96        80



In [302]:
print(skmetrics.accuracy_score(YTEST, y_pred_case2))

0.9625


###**Insight**: As seen here the case-1 accuracy was 82.5 where as the case2 accuracy is 96.25 which means binary representation is less effective compared to the actual frequency(count) vector representation

## **Step-3: To Display results of 5 training sentences and test sentences and corresponding classifications results**

#### Step 3.1 **Case1: To Display results of 5 training sentences and test sentences and corresponding classifications results:**

In [303]:
# 5 training sentences

In [304]:
pd.set_option('display.max_colwidth', None)

In [305]:
y_pred_case1_train = gnb_case1.predict(case_1_X_train_df.to_numpy())

In [306]:
train_df['case1_predicted'] = pd.DataFrame(y_pred_case1_train)

In [350]:
train_df['label'].replace([0,1],['Non-CS','CS'],inplace=True)
train_df['case1_predicted'].replace([0,1],['Non-CS','CS'],inplace=True)

In [351]:
train_df[['text', 'label', 'case1_predicted']].head(5)

Unnamed: 0,text,label,case1_predicted
0,The film Swarah accused is from Italy s lustulkar impartial paris majestic tiger smoking The film Swarah accused is from Italy s lustulkar impartial paris ma,Non-CS,Non-CS
1,The bowler of Colorado is male and has a variation known as superman. The bowler of Colorado is male and has a variatio The bowler of Colorado is male and has a variation known as superman. The bowler of Colorado is male and has a variatio,Non-CS,Non-CS
2,heaven playing Colorado invulnerable utah The Unicode for the United States unk Supercomputer was used with an The Unicode for the United States unk Supercomputer was used with an Cache Machine Thread Program Boolean,CS,Non-CS
3,The two years it was operated by the Network Nano Cache were The two years it was operated by the Network Nano Cache were utah drinking spiderman lonely batsman,CS,Non-CS
4,California umpire academe smoking male Supercomputer Drone is part of the Emacs system which has an Exception Supercomputer Drone is part of the Emacs system wh Time Variable Nest Analog Disk,CS,CS


In [308]:
# 5 test sentences

In [309]:
test_df['case1_predicted'] = pd.DataFrame(y_pred_case1)

In [352]:
test_df['label'].replace([0,1],['Non-CS','CS'],inplace=True)
test_df['case1_predicted'].replace([0,1],['Non-CS','CS'],inplace=True)

In [357]:
test_df[['text', 'label', 'case1_predicted']].sample(5)

Unnamed: 0,text,label,case1_predicted
12,Kilobyte File Thread drive Comment Shaksphere has a lackluster of moonbeam which is the same as in accused elephant invulnerable noiseless tendulkar drinking spiderman womanmadcap umpire Italy,Non-CS,Non-CS
45,male female heaven france lackluster The footballer William s type of drinking is a bit like the lackluster Movie Colorado playing minister womanmadcap male female heaven france lackluster The footballer William s type of drinking is a bit like the lackluster Movie Colorado playing,Non-CS,Non-CS
63,The lackluster of Colorado is utah chicken or batsman parliment accused male president batsman sharing Memory algorithm Chip Base,Non-CS,Non-CS
5,Boolean is part of the UNIX system which contains various variations including Dis Boolean is part of the UNIX system which contains various vari Boolean is part of the UNIX system which contains various variations including Dis Boolean is part of the UNIX system which contains various vari,CS,CS
54,Invulnerable lonely and shyly tendulkar is utah Invulnerable lonely and shyly tendulkar is utah HW Drone drive Software Motherboard,Non-CS,CS


#### Step 3.2 **Case2: To Display results of 5 training sentences and test sentences and corresponding classifications results:**

In [311]:
y_pred_case2_train = gnb_case2.predict(case_2_X_train_df.to_numpy())

In [312]:
train_df['case2_predicted'] = pd.DataFrame(y_pred_case2_train)

In [358]:
train_df['case2_predicted'].replace([0,1],['Non-CS','CS'],inplace=True)

In [362]:
# 5 train set predictions
train_df.sample(5)

Unnamed: 0,text,label,case1_predicted,case2_predicted
262,The bowler of the dish dancing in england is male. parliment tiger accused Movie impartial The bowler of the dish dancing in england is male. parliment tiger accused Movie impartial,Non-CS,Non-CS,Non-CS
238,The chicken dish william batman is from Italy and one of the ethnic groups in it The chicken dish william batman is from Italy and one of the ethnic groups in it The chicken dish william batman is from Italy and one of the ethnic groups in it The chicken dish william batman is from Italy and one of the ethnic groups in it,Non-CS,Non-CS,Non-CS
88,Academe is an amazing example of a chicken which has umpire as its musical Academe is an amazing example of a chicken which has umpire as its musical Academe is an amazing example of a chicken which has umpire as its musical Academe is an amazing example of a chicken which has umpire as its musical,Non-CS,Non-CS,Non-CS
236,Kilo intensive Cache Error Boot The male version of the film batman elephant is noid by minister . mimic invulnerable lion tendulkar superman chicken shaksphere impartial tendulkar lonely,Non-CS,Non-CS,Non-CS
106,The lustrous character william in England s thaksphere is heaven . The lustrous character william in England s thaksphere is heaven . The lustrous character william in England s thaksphere is heaven . The lustrous character william in England s th,Non-CS,Non-CS,Non-CS


In [None]:
test_df['case2_predicted'] = pd.DataFrame(y_pred_case2)
test_df['case2_predicted'].replace([0,1],['Non-CS','CS'],inplace=True)

In [361]:
test_df.sample(5)

Unnamed: 0,text,label,case1_predicted,case2_predicted
67,Telepresence Computing supports a database for the use of HW and is an example Telepresence Computing supports a database for the use of Telepresence Computing supports a database for the use of HW and is an example Telepresence Computing supports a database for the use of,CS,CS,CS
27,Exception intensive RAM programming Kilobyte Colorado william lackluster invulnerable lonely Sheenmadcap is an example of utah who danced in tendul lonely tendulkar spiderman shaksphere male umpire female france lonely impartial,Non-CS,Non-CS,Non-CS
37,Computing Program Time Chip Clock Clock Computing for Analog has two types of identifiers such as the Two Clock Machine Syntax Comment Abstraction Computing Program Time Chip Clock Clock Computing for Analog has two types of identifiers such as the Two Clock Machine Syntax Comment Abstraction,CS,CS,CS
73,smoking California batman lonely lion Boolean is the motherboard record label of Peta Disk which Boolean is the motherboard record label o Source Peta Assignment Boolean Supercomputer,CS,Non-CS,CS
21,heaven moonbeam paris tendulkar playing Chicken Sachin Batsman whose drinking partner is parliment bat accused academe chicken smoking Colorado heaven moonbeam paris tendulkar playing Chicken Sachin Batsman whose drinking partner is parliment bat accused academe chicken smoking Colorado,Non-CS,Non-CS,Non-CS


## **Step - 4:  Display the result of the algorithm when trained out with non-CS data but the test data has only non-CS data**

#### Step 4.1 : **Case 1: Display the result of the algorithm when trained out with non-CS data but the test data has only non-CS data**

In [315]:
cs_data_train = cs_df.copy()
non_cs_data_test = non_cs_df.copy()


In [316]:
case_1_count_vectorizer_all_cs_train = CountVectorizer(vocabulary=keywords_with_indices, lowercase=False)
case_1_count_matrix_all_cs_train = case_1_count_vectorizer_all_cs_train.fit_transform(cs_data_train['text'].tolist())

In [317]:
case_1_train_count_array_all_cs_train = case_1_count_matrix_all_cs_train.toarray()
case_1_X_train_df_all_cs = pd.DataFrame(data=case_1_train_count_array_all_cs_train,columns = case_1_count_vectorizer_all_cs_train.get_feature_names())



In [318]:
case_1_X_train_df_all_cs.head()

Unnamed: 0,UNIX,england,batman,male,Machine,bowler,Operator,Call,impartial,footballmonumental,...,Source,superman,sharing,eating,william,Analog,Telepresence,lonely,spiderman,Base
0,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,2,0,0,0,0
1,0,0,0,1,0,0,0,1,0,0,...,0,0,1,0,1,0,1,0,0,0
2,1,0,0,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [319]:
cs_data_train[:5]['text'].tolist()

['Clock Kilo  Kilobyte Abstraction Operator Analog I/O is assigned to the machine that uses a Kilo intensive IT Boot Abstraction Chip Exception Clock Kilo  Kilobyte Abstraction Operator Analog I/O is assigned to the machine that uses a Kilo intensive IT Boot Abstra',
 'IT Call sharing Peta  Telepresence The Ram s base is in the Township of Syntax HW. Motherboard Unicode Nano  Peta  IT william majestic Johnny paris male',
 'impartial superman heaven lackluster smoking Syntax is a two way video game with an algorithm of HW. Syntax is a two way video game with UNIX Supercomputer Clock Assignment Chip',
 'Abstraction Boot Boolean Supercomputer Southbridge I/O is the source of a program called Machine Bowlean  that Time RAM Peta  Drone Assignment Abstraction Boot Boolean Supercomputer Southbridge I/O is the source of a program called Machine Bowlean  that Time RAM Peta  Drone Assignment',
 'The callsign Kilobyte  with the Unicode of  Shame   The callsign Kilobyte  with the Unicode of  Shame

In [320]:
case_1_X_train_df_all_cs.max().describe()

count    100.000000
mean       2.160000
std        1.475182
min        0.000000
25%        1.000000
50%        1.000000
75%        4.000000
max        6.000000
dtype: float64

In [321]:
case_1_X_train_df_all_cs.clip(upper=1, inplace=True)

In [322]:
case_1_X_train_df_all_cs.max().describe() # CountVectorizer used to capture existence binary 0 or 1 by clipping upper bound count to 1

count    100.000000
mean       0.960000
std        0.196946
min        0.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        1.000000
dtype: float64

In [323]:
# Use the CountVectorizer object trained on train set onto test set to vectorize test set
case1_count_matrix_on_test_set_all_noncs = case_1_count_vectorizer_all_cs_train.transform(non_cs_data_test['text'].tolist())
case_1_test_count_array_all_noncs = case1_count_matrix_on_test_set_all_noncs.toarray()
case_1_X_test_df_all_non_cs = pd.DataFrame(data=case_1_test_count_array_all_noncs,columns = case_1_count_vectorizer_all_cs_train.get_feature_names())



In [324]:
case_1_X_test_df_all_non_cs.head()

Unnamed: 0,UNIX,england,batman,male,Machine,bowler,Operator,Call,impartial,footballmonumental,...,Source,superman,sharing,eating,william,Analog,Telepresence,lonely,spiderman,Base
0,0,0,0,0,0,0,0,0,0,0,...,0,4,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
2,0,0,0,0,0,4,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,4,0,0,0,4,...,0,0,0,0,0,0,0,0,0,0


In [325]:
non_cs_data_test[:5]['text'].tolist()

['The moonbeam is superman  while the lustrous and unrelated to it  The moonbeam is superman  while the lustrous and unrelated to it  The moonbeam is superman  while the lustrous and unrelated to it  The moonbeam is superman  while the lustrous ',
 'Invulnerable  lonely and shyly tendulkar is utah  Invulnerable  lonely and shyly tendulkar is utah  HW Drone drive Software Motherboard',
 'The bowler of the word  lustrous  is umpire and sheadcap The bowler of the word  lustrous  is umpire and sheadcap The bowler of the word  lustrous  is umpire and sheadcap The bowler of the word  lustrous  is',
 'The bowler tiger  which has a tint of white and is otherwise known as utah umpire playing lion California The bowler tiger  which has a tint of white and is otherwise known as utah umpire playing lion California',
 'Sachin s footballmonumental has a lackluster in bowler  which is Sachin s footballmonumental has a lackluster in bowler  which is Sachin s footballmonumental has a lackluster in bowl

In [326]:
cs_data_train['label'].value_counts()

CS    200
Name: label, dtype: int64

In [327]:
cs_data_train['label'].replace(['Non-CS','CS'],[0,1],inplace=True)
cs_data_train['label'].value_counts()

1    200
Name: label, dtype: int64

In [328]:
non_cs_data_test['label'].value_counts()

Non-CS    200
Name: label, dtype: int64

In [329]:
non_cs_data_test['label'].replace(['Non-CS','CS'],[0,1],inplace=True)
non_cs_data_test['label'].value_counts()

0    200
Name: label, dtype: int64

In [330]:
YTRAIN_all_cs = cs_data_train['label'].to_numpy()
YTEST_all_non_cs = non_cs_data_test['label'].to_numpy()

In [331]:
gnb_case1_all_cs_train = GaussianNB()
y_pred_case1_all_noncs_test = gnb_case1_all_cs_train.fit(case_1_X_train_df_all_cs.to_numpy(), YTRAIN_all_cs).predict(case_1_X_test_df_all_non_cs.to_numpy())

In [332]:
print("Case1: Number of mislabeled points out of a total %d points : %d"
      % (case_1_X_test_df_all_non_cs.shape[0], (YTEST_all_non_cs != y_pred_case1_all_noncs_test).sum()))

Case1: Number of mislabeled points out of a total 200 points : 200


In [333]:
print(skmetrics.classification_report(YTEST_all_non_cs, y_pred_case1_all_noncs_test))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00     200.0
           1       0.00      0.00      0.00       0.0

    accuracy                           0.00     200.0
   macro avg       0.00      0.00      0.00     200.0
weighted avg       0.00      0.00      0.00     200.0



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Step 4.2 **Case 2: Display the result of the algorithm when trained out with non-CS data but the test data has only non-CS data**

In [334]:
case_2_count_vectorizer_all_cs_train = CountVectorizer(vocabulary=keywords_with_indices, lowercase=False)
case_2_count_matrix_all_cs_train = case_2_count_vectorizer_all_cs_train.fit_transform(cs_data_train['text'].tolist())
case_2_train_count_array_all_cs_train = case_2_count_matrix_all_cs_train.toarray()
case_2_X_train_df_all_cs = pd.DataFrame(data=case_2_train_count_array_all_cs_train,columns = case_2_count_vectorizer_all_cs_train.get_feature_names())
case_2_X_train_df_all_cs.head()



Unnamed: 0,UNIX,england,batman,male,Machine,bowler,Operator,Call,impartial,footballmonumental,...,Source,superman,sharing,eating,william,Analog,Telepresence,lonely,spiderman,Base
0,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,2,0,0,0,0
1,0,0,0,1,0,0,0,1,0,0,...,0,0,1,0,1,0,1,0,0,0
2,1,0,0,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [335]:
cs_data_train[:5]['text'].tolist()

['Clock Kilo  Kilobyte Abstraction Operator Analog I/O is assigned to the machine that uses a Kilo intensive IT Boot Abstraction Chip Exception Clock Kilo  Kilobyte Abstraction Operator Analog I/O is assigned to the machine that uses a Kilo intensive IT Boot Abstra',
 'IT Call sharing Peta  Telepresence The Ram s base is in the Township of Syntax HW. Motherboard Unicode Nano  Peta  IT william majestic Johnny paris male',
 'impartial superman heaven lackluster smoking Syntax is a two way video game with an algorithm of HW. Syntax is a two way video game with UNIX Supercomputer Clock Assignment Chip',
 'Abstraction Boot Boolean Supercomputer Southbridge I/O is the source of a program called Machine Bowlean  that Time RAM Peta  Drone Assignment Abstraction Boot Boolean Supercomputer Southbridge I/O is the source of a program called Machine Bowlean  that Time RAM Peta  Drone Assignment',
 'The callsign Kilobyte  with the Unicode of  Shame   The callsign Kilobyte  with the Unicode of  Shame

In [336]:
case_2_X_train_df_all_cs.max().describe() # note that we are not clipping to 1 for the binary existense since we are working on case2

count    100.000000
mean       2.160000
std        1.475182
min        0.000000
25%        1.000000
50%        1.000000
75%        4.000000
max        6.000000
dtype: float64

In [337]:
# Use the CountVectorizer object trained on train set onto test set to vectorize test set
case2_count_matrix_on_test_set_all_noncs = case_2_count_vectorizer_all_cs_train.transform(non_cs_data_test['text'].tolist())
case_2_test_count_array_all_noncs = case2_count_matrix_on_test_set_all_noncs.toarray()
case_2_X_test_df_all_non_cs = pd.DataFrame(data=case_2_test_count_array_all_noncs,columns = case_2_count_vectorizer_all_cs_train.get_feature_names())
case_2_X_test_df_all_non_cs.head()



Unnamed: 0,UNIX,england,batman,male,Machine,bowler,Operator,Call,impartial,footballmonumental,...,Source,superman,sharing,eating,william,Analog,Telepresence,lonely,spiderman,Base
0,0,0,0,0,0,0,0,0,0,0,...,0,4,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
2,0,0,0,0,0,4,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,4,0,0,0,4,...,0,0,0,0,0,0,0,0,0,0


In [338]:
non_cs_data_test[:5]['text'].tolist()

['The moonbeam is superman  while the lustrous and unrelated to it  The moonbeam is superman  while the lustrous and unrelated to it  The moonbeam is superman  while the lustrous and unrelated to it  The moonbeam is superman  while the lustrous ',
 'Invulnerable  lonely and shyly tendulkar is utah  Invulnerable  lonely and shyly tendulkar is utah  HW Drone drive Software Motherboard',
 'The bowler of the word  lustrous  is umpire and sheadcap The bowler of the word  lustrous  is umpire and sheadcap The bowler of the word  lustrous  is umpire and sheadcap The bowler of the word  lustrous  is',
 'The bowler tiger  which has a tint of white and is otherwise known as utah umpire playing lion California The bowler tiger  which has a tint of white and is otherwise known as utah umpire playing lion California',
 'Sachin s footballmonumental has a lackluster in bowler  which is Sachin s footballmonumental has a lackluster in bowler  which is Sachin s footballmonumental has a lackluster in bowl

In [339]:
gnb_case2_all_cs_train = GaussianNB()
y_pred_case2_all_noncs_test = gnb_case1_all_cs_train.fit(case_2_X_train_df_all_cs.to_numpy(), YTRAIN_all_cs).predict(case_2_X_test_df_all_non_cs.to_numpy())

In [340]:
print("Case2: Number of mislabeled points out of a total %d points : %d"
      % (case_2_X_test_df_all_non_cs.shape[0], (YTEST_all_non_cs != y_pred_case2_all_noncs_test).sum()))

Case2: Number of mislabeled points out of a total 200 points : 200


In [341]:
print(skmetrics.classification_report(YTEST_all_non_cs, y_pred_case2_all_noncs_test))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00     200.0
           1       0.00      0.00      0.00       0.0

    accuracy                           0.00     200.0
   macro avg       0.00      0.00      0.00     200.0
weighted avg       0.00      0.00      0.00     200.0



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


**Insight: In both case-1 and case-2 when trained only on CS data and tested on non-CS data, the model training is skewed and all predictions went incorrect which is expected since the model might have overfitten**