# 20 Newsgroups Data

## Problem Description

This data set consists of 20000 messages taken from 20 newsgroups. Approximately 4% of the articles are crossposted. The articles are typical postings and thus have headers including subject lines, signature files, and quoted portions of other articles.
We need to classify each newgroup to their domain given: 
 - Each newsgroup file in the bundle represents a single newsgroup
 - Each message in a file is the text of some newsgroup document that was posted to that newsgroup.

## Importing the relevant libraries

In [1]:
# Loading Packages
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

## Loading Data

In [2]:
%%time

import os

docs=[]
labels=[]
label_index={}

PATH=os.getcwd()

text_data_dir=os.path.join(PATH,'20_newsgroups')
for name in os.listdir(text_data_dir):
    path=os.path.join(text_data_dir,name)
    if os.path.isdir(path):
        label_id=len(label_index)
        label_index[label_id]=name
        for fname in sorted(os.listdir(path)):
            fpath=os.path.join(path,fname)
            f=open(fpath,encoding="ISO-8859-1")
            t=f.read()
            docs.append(t)
            f.close()
            labels.append(label_id)

print('Found %s docs.' %len(docs))
        


Found 19997 docs.
Wall time: 3.31 s


In [3]:
label_index          # Getting labels

{0: 'alt.atheism',
 1: 'comp.graphics',
 2: 'comp.os.ms-windows.misc',
 3: 'comp.sys.ibm.pc.hardware',
 4: 'comp.sys.mac.hardware',
 5: 'comp.windows.x',
 6: 'misc.forsale',
 7: 'rec.autos',
 8: 'rec.motorcycles',
 9: 'rec.sport.baseball',
 10: 'rec.sport.hockey',
 11: 'sci.crypt',
 12: 'sci.electronics',
 13: 'sci.med',
 14: 'sci.space',
 15: 'soc.religion.christian',
 16: 'talk.politics.guns',
 17: 'talk.politics.mideast',
 18: 'talk.politics.misc',
 19: 'talk.religion.misc'}

In [4]:
type(docs)

list

In [5]:
data = pd.DataFrame(docs)   # Converting to DataFrame from list

In [6]:
data.head()

Unnamed: 0,0
0,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49...
1,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...
2,Newsgroups: alt.atheism\nPath: cantaloupe.srv....
3,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...


In [7]:
data['target'] = labels

In [8]:
data.columns=['text','target']

In [9]:
data.shape

(19997, 2)

In [10]:
#Duplicating the original text extracted before proceeding with preprocessing steps

import copy
print(type(data['text']))
original_data = copy.deepcopy(data)
print(data.keys())
print(original_data.keys())

<class 'pandas.core.series.Series'>
Index(['text', 'target'], dtype='object')
Index(['text', 'target'], dtype='object')


## Basic cleaning of text

### LowerCase all text

In [11]:
data['text'] = [text.strip().lower() for text in data['text']]
data['text'][:10]  #Checking first 10 rows

0    xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49...
1    xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...
2    newsgroups: alt.atheism\npath: cantaloupe.srv....
3    xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...
4    xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...
5    newsgroups: alt.atheism\npath: cantaloupe.srv....
6    path: cantaloupe.srv.cs.cmu.edu!crabapple.srv....
7    newsgroups: alt.atheism\npath: cantaloupe.srv....
8    path: cantaloupe.srv.cs.cmu.edu!crabapple.srv....
9    path: cantaloupe.srv.cs.cmu.edu!crabapple.srv....
Name: text, dtype: object

## Defining the functions to perform basic steps like 

- **expanding contractions**
 
- **remove accented characters**

- **scrub words**

In [12]:
type(data['text'])

pandas.core.series.Series

###  Handling contractions 

In [13]:
contractions = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"I'd": "I would",
"I'd've": "I would have",
"I'll": "I will",
"I'll've": "I will have",
"I'm": "I am",
"I've": "I have",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}

In [14]:
def expand_contractions(text):
    for word in text.split():
        if word.lower() in contractions:
            text = text.replace(word, contractions[word.lower()])
    return text

In [15]:
import re
data['text'] = [expand_contractions(re.sub('’', "'", text)) for text in data['text']]
data['text'][1]

'xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51060 alt.atheism.moderated:727 news.answers:7300 alt.answers:155\npath: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!agate!netsys!ibmpcug!mantis!mathew\nfrom: mathew <mathew@mantis.co.uk>\nnewsgroups: alt.atheism,alt.atheism.moderated,news.answers,alt.answers\nsubject: alt.atheism faq: introduction to atheism\nsummary: please read this file before posting to alt.atheism\nkeywords: faq, atheism\nmessage-id: <19930405122245@mantis.co.uk>\ndate: mon, 5 apr 1993 12:22:45 gmt\nexpires: thu, 6 may 1993 12:22:45 gmt\nfollowup-to: alt.atheism\ndistribution: world\norganization: mantis consultants, cambridge. uk.\napproved: news-answers-request@mit.edu\nsupersedes: <19930308134439@mantis.co.uk>\nlines: 646\n\narchive-name: atheism/introduction\nalt-atheism-archive-name: introduction\nlast-modified: 5 april 1993\nversion: 1.2\n\n-----begin pgp signed message-----\n\n                  

## Invoking the remove_accented_chars() function

In [16]:
data['text'][2]

'newsgroups: alt.atheism\npath: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!usc!sdd.hp.com!nigel.msen.com!yale.edu!ira.uka.de!news.dfn.de!tubsibr!dbstu1.rz.tu-bs.de!i3150101\nfrom: i3150101@dbstu1.rz.tu-bs.de (benedikt rosenau)\nsubject: re: gospel dating\nmessage-id: <16ba711b3a.i3150101@dbstu1.rz.tu-bs.de>\nsender: postnntp@ibr.cs.tu-bs.de (mr. nntp inews entry)\norganization: technical university braunschweig, germany\nreferences: <16ba1e197.i3150101@dbstu1.rz.tu-bs.de> <65974@mimsy.umd.edu>\ndate: mon, 5 apr 1993 19:08:25 gmt\nlines: 93\n\nin article <65974@mimsy.umd.edu>\nmangoe@cs.umd.edu (charley wingate) writes:\n \n>>well, john has a quite different, not necessarily more elaborated theology.\n>>there is some evidence that he must have known luke, and that the content\n>>of q was known to him, but not in a \'canonized\' form.\n>\n>this is a new argument to me.  could you elaborate a little?\n>\n \nthe argument 

In [17]:
import unicodedata
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    #https://docs.python.org/2/library/unicodedata.html
    return text

In [18]:
data['text'] = [remove_accented_chars(text) for text in data['text']]
data['text'][2]

'newsgroups: alt.atheism\npath: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!usc!sdd.hp.com!nigel.msen.com!yale.edu!ira.uka.de!news.dfn.de!tubsibr!dbstu1.rz.tu-bs.de!i3150101\nfrom: i3150101@dbstu1.rz.tu-bs.de (benedikt rosenau)\nsubject: re: gospel dating\nmessage-id: <16ba711b3a.i3150101@dbstu1.rz.tu-bs.de>\nsender: postnntp@ibr.cs.tu-bs.de (mr. nntp inews entry)\norganization: technical university braunschweig, germany\nreferences: <16ba1e197.i3150101@dbstu1.rz.tu-bs.de> <65974@mimsy.umd.edu>\ndate: mon, 5 apr 1993 19:08:25 gmt\nlines: 93\n\nin article <65974@mimsy.umd.edu>\nmangoe@cs.umd.edu (charley wingate) writes:\n \n>>well, john has a quite different, not necessarily more elaborated theology.\n>>there is some evidence that he must have known luke, and that the content\n>>of q was known to him, but not in a \'canonized\' form.\n>\n>this is a new argument to me.  could you elaborate a little?\n>\n \nthe argument 

## Invoking various scrub functions

In [19]:
def scrub_words(text):
    #Replace \xao characters in text
    text = re.sub('\xa0', ' ', text)
    
    #Replace non ascii / not words and digits
    text = re.sub("(\\W|\\d)",' ',text)
    
    #Replace new line characters and following text untill space
    text = re.sub('\n(\w*?)[\s]', '', text)
    
    #Remove html markup
    text = re.sub("<.*?>", ' ', text)
    
    #Remove extra spaces from the text
    text = re.sub("\s+", ' ', text)
    return text

In [20]:
data['text'] = [scrub_words(text) for text in data['text']]
data['text'][1]

'xref cantaloupe srv cs cmu edu alt atheism alt atheism moderated news answers alt answers path cantaloupe srv cs cmu edu crabapple srv cs cmu edu fs ece cmu edu europa eng gtefsd com howland reston ans net agate netsys ibmpcug mantis mathew from mathew mathew mantis co uk newsgroups alt atheism alt atheism moderated news answers alt answers subject alt atheism faq introduction to atheism summary please read this file before posting to alt atheism keywords faq atheism message id mantis co uk date mon apr gmt expires thu may gmt followup to alt atheism distribution world organization mantis consultants cambridge uk approved news answers request mit edu supersedes mantis co uk lines archive name atheism introduction alt atheism archive name introduction last modified april version begin pgp signed message an introduction to atheism by mathew mathew mantis co uk this article attempts to provide a general introduction to atheism whilst i have tried to be as neutral as possible regarding co

## Checking the integrity of the data after initial preprocessing steps

In [21]:
print("Data Type: ",type(original_data['text']))
print("Data Type: ",type(data['text']))

print("Length of data: ",len(original_data['text']))
print("Length of data: ",len(data['text']))

print("Original data: \n",original_data['text'][0])
print("\n\n**************************************************************************\n\n")
print("Clean data: \n",data['text'][0])

Data Type:  <class 'pandas.core.series.Series'>
Data Type:  <class 'pandas.core.series.Series'>
Length of data:  19997
Length of data:  19997
Original data: 
 Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49960 alt.atheism.moderated:713 news.answers:7054 alt.answers:126
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!uunet!pipex!ibmpcug!mantis!mathew
From: mathew <mathew@mantis.co.uk>
Newsgroups: alt.atheism,alt.atheism.moderated,news.answers,alt.answers
Subject: Alt.Atheism FAQ: Atheist Resources
Summary: Books, addresses, music -- anything related to atheism
Keywords: FAQ, atheism, books, music, fiction, addresses, contacts
Message-ID: <19930329115719@mantis.co.uk>
Date: Mon, 29 Mar 1993 11:57:19 GMT
Expires: Thu, 29 Apr 1993 11:57:19 GMT
Followup-To: alt.atheism
Distribution: world
Organization: Mantis Consultants, Cambridge. UK.
Approved: news-answers-reque

In [22]:
print("Original data: \n",original_data['text'][1])
print("\n\n**************************************************************************\n\n")
print("Clean data: \n",data['text'][1])

Original data: 
 Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51060 alt.atheism.moderated:727 news.answers:7300 alt.answers:155
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!agate!netsys!ibmpcug!mantis!mathew
From: mathew <mathew@mantis.co.uk>
Newsgroups: alt.atheism,alt.atheism.moderated,news.answers,alt.answers
Subject: Alt.Atheism FAQ: Introduction to Atheism
Summary: Please read this file before posting to alt.atheism
Keywords: FAQ, atheism
Message-ID: <19930405122245@mantis.co.uk>
Date: Mon, 5 Apr 1993 12:22:45 GMT
Expires: Thu, 6 May 1993 12:22:45 GMT
Followup-To: alt.atheism
Distribution: world
Organization: Mantis Consultants, Cambridge. UK.
Approved: news-answers-request@mit.edu
Supersedes: <19930308134439@mantis.co.uk>
Lines: 646

Archive-name: atheism/introduction
Alt-atheism-archive-name: introduction
Last-modified: 5 April 1993
Version: 1.2

-----BEGIN PGP SIGNED MESSAGE-----

                          

# Text Preprocessing

#### Adding new column "word_count" which specifies the number of tokens in each document

In [23]:
data['word_count'] = [len(text.split(' ')) for text in data['text']]
pd.DataFrame(data['word_count']).describe()

Unnamed: 0,word_count
count,19997.0
mean,369.937991
std,724.689806
min,48.0
25%,178.0
50%,252.0
75%,372.0
max,39436.0


### Converting the dictionary to Dataframe 

Converting dictionary to dataframe as pandas provide better and readable subsetting options

In [24]:
data.keys()

Index(['text', 'target', 'word_count'], dtype='object')

In [25]:
data.head()

Unnamed: 0,text,target,word_count
0,xref cantaloupe srv cs cmu edu alt atheism alt...,0,1772
1,xref cantaloupe srv cs cmu edu alt atheism alt...,0,5425
2,newsgroups alt atheism path cantaloupe srv cs ...,0,806
3,xref cantaloupe srv cs cmu edu alt atheism alt...,0,325
4,xref cantaloupe srv cs cmu edu alt atheism soc...,0,206


In [26]:
news_df = pd.DataFrame(data)
print("Shape: ",news_df.shape)
news_df.head(5)

Shape:  (19997, 3)


Unnamed: 0,text,target,word_count
0,xref cantaloupe srv cs cmu edu alt atheism alt...,0,1772
1,xref cantaloupe srv cs cmu edu alt atheism alt...,0,5425
2,newsgroups alt atheism path cantaloupe srv cs ...,0,806
3,xref cantaloupe srv cs cmu edu alt atheism alt...,0,325
4,xref cantaloupe srv cs cmu edu alt atheism soc...,0,206


### Removing all the blogs with words_count value less than first quartile (25%) of words_count attribute

In [27]:
## Getting the first quartile value
q1 = np.percentile(news_df.word_count,25)
print(f"The first quartile value of words_count attribute is {q1}")

The first quartile value of words_count attribute is 178.0


In [28]:
news_df = news_df[news_df['word_count'] > q1]
print(f"The shape of trimmed blogs dataframe is {news_df.shape}")

The shape of trimmed blogs dataframe is (14996, 3)


#### Converting dataframe back to dictionary

In [29]:
data = news_df.reset_index().to_dict(orient='list')
print(f"The keys in the dictionary are {data.keys()}")

The keys in the dictionary are dict_keys(['index', 'text', 'target', 'word_count'])


In [30]:
print(data['text'][5])

newsgroups alt atheism path cantaloupe srv cs cmu edu crabapple srv cs cmu edu fs ece cmu edu europa eng gtefsd com howland reston ans net usc sdd hp com nigel msen com yale edu ira uka de news dfn de tubsibr dbstu rz tu bs de i from i dbstu rz tu bs de benedikt rosenau subject re a visit from the jehovah s witnesses message id ba ef i dbstu rz tu bs de sender postnntp ibr cs tu bs de mr nntp inews entry organization technical university braunschweig germany references bskendigc kd z cdc netcom com p v ainn e matt ksu ksu edu ba da i dbstu rz tu bs de apr batman bmd trw com date mon apr gmt lines in article apr batman bmd trw com jbrown batman bmd trw com writes did not you say lucifer was created with a perfect nature yes define perfect then i think you are playing the usual game here make sweeping statements like omni holy or perfect and do not note that they mean exactly what they say and that says that you must not use this terms when it leads to contradictions i m not trying to pl

In [31]:
type(data['text'])

list

## Stopwords, stemming, and tokenizing

In [32]:
#!conda install -c conda-forge spacy
#!python -m spacy download en_core_web_sm
#!pip install -U spacy
import spacy
nlp = spacy.load("en_core_web_sm")

# import en_core_web_sm
#nlp = en_core_web_sm.load()

In [33]:
data['text'][1]

'xref cantaloupe srv cs cmu edu alt atheism alt atheism moderated news answers alt answers path cantaloupe srv cs cmu edu crabapple srv cs cmu edu fs ece cmu edu europa eng gtefsd com howland reston ans net agate netsys ibmpcug mantis mathew from mathew mathew mantis co uk newsgroups alt atheism alt atheism moderated news answers alt answers subject alt atheism faq introduction to atheism summary please read this file before posting to alt atheism keywords faq atheism message id mantis co uk date mon apr gmt expires thu may gmt followup to alt atheism distribution world organization mantis consultants cambridge uk approved news answers request mit edu supersedes mantis co uk lines archive name atheism introduction alt atheism archive name introduction last modified april version begin pgp signed message an introduction to atheism by mathew mathew mantis co uk this article attempts to provide a general introduction to atheism whilst i have tried to be as neutral as possible regarding co

In [34]:
## Adding Custom stopwords to the spacy stopword list
customize_stop_words = ['xref']

for w in customize_stop_words:
    nlp.vocab[w].is_stop = True

In [35]:
## It might be surprising, but spaCy doesn't contain any function for stemming as it relies on lemmatization only. 
## Therefore, in this section, we will use NLTK for stemming.

## load nltk's SnowballStemmer as variable 'stemmer'
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [36]:
# Here I define a tokenizer and stemmer which returns the set of stems (excluding stop words) in the text that it is passed

def tokenize_and_stem(doc, remove_stopwords = True):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    if remove_stopwords:
        tokens = [word.text for word in doc if not word.is_stop]
    else:
        tokens = [word.text for word in doc]

    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)

    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

def tokenize_and_lemmatize(doc, remove_stopwords = True):
    
    if remove_stopwords:
        tokens = [word for word in doc if not word.is_stop]
    else:
        tokens = [word for word in doc]
        
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token.text):
            filtered_tokens.append(token)
            
    lemma = [t.lemma_ for t in filtered_tokens]
    return lemma


def tokenize_only(doc, remove_stopwords = True):
    
    if remove_stopwords:
        tokens = [word.text for word in doc if not word.is_stop]
    else:
        tokens = [word.text for word in doc]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [37]:
%%time
## We are trying to create four seperate lists for text with stop words, text without stop words,
## text with stemmed words and text with lemmatized words.

## Naming Conventions followed ####

## 'clean' word is appended to lists which do not contain stopwords

## 'all' keyword is appended to lists which contain stopwords.

## use extend so it's a big flat list of vocab

data['clean_text_stemmed'] = []
data['clean_text_lemmatized'] = []
data['text_stemmed'] = []
data['text_lemmatized'] = []

vocab_stemmed = []
allvocab_stemmed =[]

vocab_tokenized = []
allvocab_tokenized = []

vocab_lemmatized = []
allvocab_lemmatized = []


for idx,text in enumerate(data['text']):

    doc = nlp(text)
    print(f"processing {idx} document")
    words_stemmed = tokenize_and_stem(doc)
    vocab_stemmed.extend(words_stemmed)
    data['clean_text_stemmed'].append(words_stemmed)
        
    words_lemmatized = tokenize_and_lemmatize(doc)
    vocab_lemmatized.extend(words_lemmatized)
    data['clean_text_lemmatized'].append(words_lemmatized)
    
       
    allwords_stemmed = tokenize_and_stem(doc, False)
    allvocab_stemmed.extend(allwords_stemmed)
    data['text_stemmed'].append(allwords_stemmed)
    
    allwords_lemmatized = tokenize_and_lemmatize(doc, False)
    allvocab_lemmatized.extend(allwords_lemmatized)
    data['text_lemmatized'].append(allwords_lemmatized)
    
    allwords_tokenized = tokenize_only(doc,False)
    allvocab_tokenized.extend(allwords_tokenized)
    
    words_tokenized = tokenize_only(doc)
    vocab_tokenized.extend(words_tokenized)

processing 0 document
processing 1 document
processing 2 document
processing 3 document
processing 4 document
processing 5 document
processing 6 document
processing 7 document
processing 8 document
processing 9 document
processing 10 document
processing 11 document
processing 12 document
processing 13 document
processing 14 document
processing 15 document
processing 16 document
processing 17 document
processing 18 document
processing 19 document
processing 20 document
processing 21 document
processing 22 document
processing 23 document
processing 24 document
processing 25 document
processing 26 document
processing 27 document
processing 28 document
processing 29 document
processing 30 document
processing 31 document
processing 32 document
processing 33 document
processing 34 document
processing 35 document
processing 36 document
processing 37 document
processing 38 document
processing 39 document
processing 40 document
processing 41 document
processing 42 document
processing 43 documen

processing 346 document
processing 347 document
processing 348 document
processing 349 document
processing 350 document
processing 351 document
processing 352 document
processing 353 document
processing 354 document
processing 355 document
processing 356 document
processing 357 document
processing 358 document
processing 359 document
processing 360 document
processing 361 document
processing 362 document
processing 363 document
processing 364 document
processing 365 document
processing 366 document
processing 367 document
processing 368 document
processing 369 document
processing 370 document
processing 371 document
processing 372 document
processing 373 document
processing 374 document
processing 375 document
processing 376 document
processing 377 document
processing 378 document
processing 379 document
processing 380 document
processing 381 document
processing 382 document
processing 383 document
processing 384 document
processing 385 document
processing 386 document
processing 387 d

processing 689 document
processing 690 document
processing 691 document
processing 692 document
processing 693 document
processing 694 document
processing 695 document
processing 696 document
processing 697 document
processing 698 document
processing 699 document
processing 700 document
processing 701 document
processing 702 document
processing 703 document
processing 704 document
processing 705 document
processing 706 document
processing 707 document
processing 708 document
processing 709 document
processing 710 document
processing 711 document
processing 712 document
processing 713 document
processing 714 document
processing 715 document
processing 716 document
processing 717 document
processing 718 document
processing 719 document
processing 720 document
processing 721 document
processing 722 document
processing 723 document
processing 724 document
processing 725 document
processing 726 document
processing 727 document
processing 728 document
processing 729 document
processing 730 d

processing 1030 document
processing 1031 document
processing 1032 document
processing 1033 document
processing 1034 document
processing 1035 document
processing 1036 document
processing 1037 document
processing 1038 document
processing 1039 document
processing 1040 document
processing 1041 document
processing 1042 document
processing 1043 document
processing 1044 document
processing 1045 document
processing 1046 document
processing 1047 document
processing 1048 document
processing 1049 document
processing 1050 document
processing 1051 document
processing 1052 document
processing 1053 document
processing 1054 document
processing 1055 document
processing 1056 document
processing 1057 document
processing 1058 document
processing 1059 document
processing 1060 document
processing 1061 document
processing 1062 document
processing 1063 document
processing 1064 document
processing 1065 document
processing 1066 document
processing 1067 document
processing 1068 document
processing 1069 document


processing 1360 document
processing 1361 document
processing 1362 document
processing 1363 document
processing 1364 document
processing 1365 document
processing 1366 document
processing 1367 document
processing 1368 document
processing 1369 document
processing 1370 document
processing 1371 document
processing 1372 document
processing 1373 document
processing 1374 document
processing 1375 document
processing 1376 document
processing 1377 document
processing 1378 document
processing 1379 document
processing 1380 document
processing 1381 document
processing 1382 document
processing 1383 document
processing 1384 document
processing 1385 document
processing 1386 document
processing 1387 document
processing 1388 document
processing 1389 document
processing 1390 document
processing 1391 document
processing 1392 document
processing 1393 document
processing 1394 document
processing 1395 document
processing 1396 document
processing 1397 document
processing 1398 document
processing 1399 document


processing 1691 document
processing 1692 document
processing 1693 document
processing 1694 document
processing 1695 document
processing 1696 document
processing 1697 document
processing 1698 document
processing 1699 document
processing 1700 document
processing 1701 document
processing 1702 document
processing 1703 document
processing 1704 document
processing 1705 document
processing 1706 document
processing 1707 document
processing 1708 document
processing 1709 document
processing 1710 document
processing 1711 document
processing 1712 document
processing 1713 document
processing 1714 document
processing 1715 document
processing 1716 document
processing 1717 document
processing 1718 document
processing 1719 document
processing 1720 document
processing 1721 document
processing 1722 document
processing 1723 document
processing 1724 document
processing 1725 document
processing 1726 document
processing 1727 document
processing 1728 document
processing 1729 document
processing 1730 document


processing 2020 document
processing 2021 document
processing 2022 document
processing 2023 document
processing 2024 document
processing 2025 document
processing 2026 document
processing 2027 document
processing 2028 document
processing 2029 document
processing 2030 document
processing 2031 document
processing 2032 document
processing 2033 document
processing 2034 document
processing 2035 document
processing 2036 document
processing 2037 document
processing 2038 document
processing 2039 document
processing 2040 document
processing 2041 document
processing 2042 document
processing 2043 document
processing 2044 document
processing 2045 document
processing 2046 document
processing 2047 document
processing 2048 document
processing 2049 document
processing 2050 document
processing 2051 document
processing 2052 document
processing 2053 document
processing 2054 document
processing 2055 document
processing 2056 document
processing 2057 document
processing 2058 document
processing 2059 document


processing 2349 document
processing 2350 document
processing 2351 document
processing 2352 document
processing 2353 document
processing 2354 document
processing 2355 document
processing 2356 document
processing 2357 document
processing 2358 document
processing 2359 document
processing 2360 document
processing 2361 document
processing 2362 document
processing 2363 document
processing 2364 document
processing 2365 document
processing 2366 document
processing 2367 document
processing 2368 document
processing 2369 document
processing 2370 document
processing 2371 document
processing 2372 document
processing 2373 document
processing 2374 document
processing 2375 document
processing 2376 document
processing 2377 document
processing 2378 document
processing 2379 document
processing 2380 document
processing 2381 document
processing 2382 document
processing 2383 document
processing 2384 document
processing 2385 document
processing 2386 document
processing 2387 document
processing 2388 document


processing 2677 document
processing 2678 document
processing 2679 document
processing 2680 document
processing 2681 document
processing 2682 document
processing 2683 document
processing 2684 document
processing 2685 document
processing 2686 document
processing 2687 document
processing 2688 document
processing 2689 document
processing 2690 document
processing 2691 document
processing 2692 document
processing 2693 document
processing 2694 document
processing 2695 document
processing 2696 document
processing 2697 document
processing 2698 document
processing 2699 document
processing 2700 document
processing 2701 document
processing 2702 document
processing 2703 document
processing 2704 document
processing 2705 document
processing 2706 document
processing 2707 document
processing 2708 document
processing 2709 document
processing 2710 document
processing 2711 document
processing 2712 document
processing 2713 document
processing 2714 document
processing 2715 document
processing 2716 document


processing 3005 document
processing 3006 document
processing 3007 document
processing 3008 document
processing 3009 document
processing 3010 document
processing 3011 document
processing 3012 document
processing 3013 document
processing 3014 document
processing 3015 document
processing 3016 document
processing 3017 document
processing 3018 document
processing 3019 document
processing 3020 document
processing 3021 document
processing 3022 document
processing 3023 document
processing 3024 document
processing 3025 document
processing 3026 document
processing 3027 document
processing 3028 document
processing 3029 document
processing 3030 document
processing 3031 document
processing 3032 document
processing 3033 document
processing 3034 document
processing 3035 document
processing 3036 document
processing 3037 document
processing 3038 document
processing 3039 document
processing 3040 document
processing 3041 document
processing 3042 document
processing 3043 document
processing 3044 document


processing 3335 document
processing 3336 document
processing 3337 document
processing 3338 document
processing 3339 document
processing 3340 document
processing 3341 document
processing 3342 document
processing 3343 document
processing 3344 document
processing 3345 document
processing 3346 document
processing 3347 document
processing 3348 document
processing 3349 document
processing 3350 document
processing 3351 document
processing 3352 document
processing 3353 document
processing 3354 document
processing 3355 document
processing 3356 document
processing 3357 document
processing 3358 document
processing 3359 document
processing 3360 document
processing 3361 document
processing 3362 document
processing 3363 document
processing 3364 document
processing 3365 document
processing 3366 document
processing 3367 document
processing 3368 document
processing 3369 document
processing 3370 document
processing 3371 document
processing 3372 document
processing 3373 document
processing 3374 document


processing 3665 document
processing 3666 document
processing 3667 document
processing 3668 document
processing 3669 document
processing 3670 document
processing 3671 document
processing 3672 document
processing 3673 document
processing 3674 document
processing 3675 document
processing 3676 document
processing 3677 document
processing 3678 document
processing 3679 document
processing 3680 document
processing 3681 document
processing 3682 document
processing 3683 document
processing 3684 document
processing 3685 document
processing 3686 document
processing 3687 document
processing 3688 document
processing 3689 document
processing 3690 document
processing 3691 document
processing 3692 document
processing 3693 document
processing 3694 document
processing 3695 document
processing 3696 document
processing 3697 document
processing 3698 document
processing 3699 document
processing 3700 document
processing 3701 document
processing 3702 document
processing 3703 document
processing 3704 document


processing 3995 document
processing 3996 document
processing 3997 document
processing 3998 document
processing 3999 document
processing 4000 document
processing 4001 document
processing 4002 document
processing 4003 document
processing 4004 document
processing 4005 document
processing 4006 document
processing 4007 document
processing 4008 document
processing 4009 document
processing 4010 document
processing 4011 document
processing 4012 document
processing 4013 document
processing 4014 document
processing 4015 document
processing 4016 document
processing 4017 document
processing 4018 document
processing 4019 document
processing 4020 document
processing 4021 document
processing 4022 document
processing 4023 document
processing 4024 document
processing 4025 document
processing 4026 document
processing 4027 document
processing 4028 document
processing 4029 document
processing 4030 document
processing 4031 document
processing 4032 document
processing 4033 document
processing 4034 document


processing 4325 document
processing 4326 document
processing 4327 document
processing 4328 document
processing 4329 document
processing 4330 document
processing 4331 document
processing 4332 document
processing 4333 document
processing 4334 document
processing 4335 document
processing 4336 document
processing 4337 document
processing 4338 document
processing 4339 document
processing 4340 document
processing 4341 document
processing 4342 document
processing 4343 document
processing 4344 document
processing 4345 document
processing 4346 document
processing 4347 document
processing 4348 document
processing 4349 document
processing 4350 document
processing 4351 document
processing 4352 document
processing 4353 document
processing 4354 document
processing 4355 document
processing 4356 document
processing 4357 document
processing 4358 document
processing 4359 document
processing 4360 document
processing 4361 document
processing 4362 document
processing 4363 document
processing 4364 document


processing 4656 document
processing 4657 document
processing 4658 document
processing 4659 document
processing 4660 document
processing 4661 document
processing 4662 document
processing 4663 document
processing 4664 document
processing 4665 document
processing 4666 document
processing 4667 document
processing 4668 document
processing 4669 document
processing 4670 document
processing 4671 document
processing 4672 document
processing 4673 document
processing 4674 document
processing 4675 document
processing 4676 document
processing 4677 document
processing 4678 document
processing 4679 document
processing 4680 document
processing 4681 document
processing 4682 document
processing 4683 document
processing 4684 document
processing 4685 document
processing 4686 document
processing 4687 document
processing 4688 document
processing 4689 document
processing 4690 document
processing 4691 document
processing 4692 document
processing 4693 document
processing 4694 document
processing 4695 document


processing 4986 document
processing 4987 document
processing 4988 document
processing 4989 document
processing 4990 document
processing 4991 document
processing 4992 document
processing 4993 document
processing 4994 document
processing 4995 document
processing 4996 document
processing 4997 document
processing 4998 document
processing 4999 document
processing 5000 document
processing 5001 document
processing 5002 document
processing 5003 document
processing 5004 document
processing 5005 document
processing 5006 document
processing 5007 document
processing 5008 document
processing 5009 document
processing 5010 document
processing 5011 document
processing 5012 document
processing 5013 document
processing 5014 document
processing 5015 document
processing 5016 document
processing 5017 document
processing 5018 document
processing 5019 document
processing 5020 document
processing 5021 document
processing 5022 document
processing 5023 document
processing 5024 document
processing 5025 document


processing 5314 document
processing 5315 document
processing 5316 document
processing 5317 document
processing 5318 document
processing 5319 document
processing 5320 document
processing 5321 document
processing 5322 document
processing 5323 document
processing 5324 document
processing 5325 document
processing 5326 document
processing 5327 document
processing 5328 document
processing 5329 document
processing 5330 document
processing 5331 document
processing 5332 document
processing 5333 document
processing 5334 document
processing 5335 document
processing 5336 document
processing 5337 document
processing 5338 document
processing 5339 document
processing 5340 document
processing 5341 document
processing 5342 document
processing 5343 document
processing 5344 document
processing 5345 document
processing 5346 document
processing 5347 document
processing 5348 document
processing 5349 document
processing 5350 document
processing 5351 document
processing 5352 document
processing 5353 document


processing 5644 document
processing 5645 document
processing 5646 document
processing 5647 document
processing 5648 document
processing 5649 document
processing 5650 document
processing 5651 document
processing 5652 document
processing 5653 document
processing 5654 document
processing 5655 document
processing 5656 document
processing 5657 document
processing 5658 document
processing 5659 document
processing 5660 document
processing 5661 document
processing 5662 document
processing 5663 document
processing 5664 document
processing 5665 document
processing 5666 document
processing 5667 document
processing 5668 document
processing 5669 document
processing 5670 document
processing 5671 document
processing 5672 document
processing 5673 document
processing 5674 document
processing 5675 document
processing 5676 document
processing 5677 document
processing 5678 document
processing 5679 document
processing 5680 document
processing 5681 document
processing 5682 document
processing 5683 document


processing 5972 document
processing 5973 document
processing 5974 document
processing 5975 document
processing 5976 document
processing 5977 document
processing 5978 document
processing 5979 document
processing 5980 document
processing 5981 document
processing 5982 document
processing 5983 document
processing 5984 document
processing 5985 document
processing 5986 document
processing 5987 document
processing 5988 document
processing 5989 document
processing 5990 document
processing 5991 document
processing 5992 document
processing 5993 document
processing 5994 document
processing 5995 document
processing 5996 document
processing 5997 document
processing 5998 document
processing 5999 document
processing 6000 document
processing 6001 document
processing 6002 document
processing 6003 document
processing 6004 document
processing 6005 document
processing 6006 document
processing 6007 document
processing 6008 document
processing 6009 document
processing 6010 document
processing 6011 document


processing 6302 document
processing 6303 document
processing 6304 document
processing 6305 document
processing 6306 document
processing 6307 document
processing 6308 document
processing 6309 document
processing 6310 document
processing 6311 document
processing 6312 document
processing 6313 document
processing 6314 document
processing 6315 document
processing 6316 document
processing 6317 document
processing 6318 document
processing 6319 document
processing 6320 document
processing 6321 document
processing 6322 document
processing 6323 document
processing 6324 document
processing 6325 document
processing 6326 document
processing 6327 document
processing 6328 document
processing 6329 document
processing 6330 document
processing 6331 document
processing 6332 document
processing 6333 document
processing 6334 document
processing 6335 document
processing 6336 document
processing 6337 document
processing 6338 document
processing 6339 document
processing 6340 document
processing 6341 document


processing 6633 document
processing 6634 document
processing 6635 document
processing 6636 document
processing 6637 document
processing 6638 document
processing 6639 document
processing 6640 document
processing 6641 document
processing 6642 document
processing 6643 document
processing 6644 document
processing 6645 document
processing 6646 document
processing 6647 document
processing 6648 document
processing 6649 document
processing 6650 document
processing 6651 document
processing 6652 document
processing 6653 document
processing 6654 document
processing 6655 document
processing 6656 document
processing 6657 document
processing 6658 document
processing 6659 document
processing 6660 document
processing 6661 document
processing 6662 document
processing 6663 document
processing 6664 document
processing 6665 document
processing 6666 document
processing 6667 document
processing 6668 document
processing 6669 document
processing 6670 document
processing 6671 document
processing 6672 document


processing 6965 document
processing 6966 document
processing 6967 document
processing 6968 document
processing 6969 document
processing 6970 document
processing 6971 document
processing 6972 document
processing 6973 document
processing 6974 document
processing 6975 document
processing 6976 document
processing 6977 document
processing 6978 document
processing 6979 document
processing 6980 document
processing 6981 document
processing 6982 document
processing 6983 document
processing 6984 document
processing 6985 document
processing 6986 document
processing 6987 document
processing 6988 document
processing 6989 document
processing 6990 document
processing 6991 document
processing 6992 document
processing 6993 document
processing 6994 document
processing 6995 document
processing 6996 document
processing 6997 document
processing 6998 document
processing 6999 document
processing 7000 document
processing 7001 document
processing 7002 document
processing 7003 document
processing 7004 document


processing 7295 document
processing 7296 document
processing 7297 document
processing 7298 document
processing 7299 document
processing 7300 document
processing 7301 document
processing 7302 document
processing 7303 document
processing 7304 document
processing 7305 document
processing 7306 document
processing 7307 document
processing 7308 document
processing 7309 document
processing 7310 document
processing 7311 document
processing 7312 document
processing 7313 document
processing 7314 document
processing 7315 document
processing 7316 document
processing 7317 document
processing 7318 document
processing 7319 document
processing 7320 document
processing 7321 document
processing 7322 document
processing 7323 document
processing 7324 document
processing 7325 document
processing 7326 document
processing 7327 document
processing 7328 document
processing 7329 document
processing 7330 document
processing 7331 document
processing 7332 document
processing 7333 document
processing 7334 document


processing 7623 document
processing 7624 document
processing 7625 document
processing 7626 document
processing 7627 document
processing 7628 document
processing 7629 document
processing 7630 document
processing 7631 document
processing 7632 document
processing 7633 document
processing 7634 document
processing 7635 document
processing 7636 document
processing 7637 document
processing 7638 document
processing 7639 document
processing 7640 document
processing 7641 document
processing 7642 document
processing 7643 document
processing 7644 document
processing 7645 document
processing 7646 document
processing 7647 document
processing 7648 document
processing 7649 document
processing 7650 document
processing 7651 document
processing 7652 document
processing 7653 document
processing 7654 document
processing 7655 document
processing 7656 document
processing 7657 document
processing 7658 document
processing 7659 document
processing 7660 document
processing 7661 document
processing 7662 document


processing 7951 document
processing 7952 document
processing 7953 document
processing 7954 document
processing 7955 document
processing 7956 document
processing 7957 document
processing 7958 document
processing 7959 document
processing 7960 document
processing 7961 document
processing 7962 document
processing 7963 document
processing 7964 document
processing 7965 document
processing 7966 document
processing 7967 document
processing 7968 document
processing 7969 document
processing 7970 document
processing 7971 document
processing 7972 document
processing 7973 document
processing 7974 document
processing 7975 document
processing 7976 document
processing 7977 document
processing 7978 document
processing 7979 document
processing 7980 document
processing 7981 document
processing 7982 document
processing 7983 document
processing 7984 document
processing 7985 document
processing 7986 document
processing 7987 document
processing 7988 document
processing 7989 document
processing 7990 document


processing 8279 document
processing 8280 document
processing 8281 document
processing 8282 document
processing 8283 document
processing 8284 document
processing 8285 document
processing 8286 document
processing 8287 document
processing 8288 document
processing 8289 document
processing 8290 document
processing 8291 document
processing 8292 document
processing 8293 document
processing 8294 document
processing 8295 document
processing 8296 document
processing 8297 document
processing 8298 document
processing 8299 document
processing 8300 document
processing 8301 document
processing 8302 document
processing 8303 document
processing 8304 document
processing 8305 document
processing 8306 document
processing 8307 document
processing 8308 document
processing 8309 document
processing 8310 document
processing 8311 document
processing 8312 document
processing 8313 document
processing 8314 document
processing 8315 document
processing 8316 document
processing 8317 document
processing 8318 document


processing 8608 document
processing 8609 document
processing 8610 document
processing 8611 document
processing 8612 document
processing 8613 document
processing 8614 document
processing 8615 document
processing 8616 document
processing 8617 document
processing 8618 document
processing 8619 document
processing 8620 document
processing 8621 document
processing 8622 document
processing 8623 document
processing 8624 document
processing 8625 document
processing 8626 document
processing 8627 document
processing 8628 document
processing 8629 document
processing 8630 document
processing 8631 document
processing 8632 document
processing 8633 document
processing 8634 document
processing 8635 document
processing 8636 document
processing 8637 document
processing 8638 document
processing 8639 document
processing 8640 document
processing 8641 document
processing 8642 document
processing 8643 document
processing 8644 document
processing 8645 document
processing 8646 document
processing 8647 document


processing 8937 document
processing 8938 document
processing 8939 document
processing 8940 document
processing 8941 document
processing 8942 document
processing 8943 document
processing 8944 document
processing 8945 document
processing 8946 document
processing 8947 document
processing 8948 document
processing 8949 document
processing 8950 document
processing 8951 document
processing 8952 document
processing 8953 document
processing 8954 document
processing 8955 document
processing 8956 document
processing 8957 document
processing 8958 document
processing 8959 document
processing 8960 document
processing 8961 document
processing 8962 document
processing 8963 document
processing 8964 document
processing 8965 document
processing 8966 document
processing 8967 document
processing 8968 document
processing 8969 document
processing 8970 document
processing 8971 document
processing 8972 document
processing 8973 document
processing 8974 document
processing 8975 document
processing 8976 document


processing 9266 document
processing 9267 document
processing 9268 document
processing 9269 document
processing 9270 document
processing 9271 document
processing 9272 document
processing 9273 document
processing 9274 document
processing 9275 document
processing 9276 document
processing 9277 document
processing 9278 document
processing 9279 document
processing 9280 document
processing 9281 document
processing 9282 document
processing 9283 document
processing 9284 document
processing 9285 document
processing 9286 document
processing 9287 document
processing 9288 document
processing 9289 document
processing 9290 document
processing 9291 document
processing 9292 document
processing 9293 document
processing 9294 document
processing 9295 document
processing 9296 document
processing 9297 document
processing 9298 document
processing 9299 document
processing 9300 document
processing 9301 document
processing 9302 document
processing 9303 document
processing 9304 document
processing 9305 document


processing 9594 document
processing 9595 document
processing 9596 document
processing 9597 document
processing 9598 document
processing 9599 document
processing 9600 document
processing 9601 document
processing 9602 document
processing 9603 document
processing 9604 document
processing 9605 document
processing 9606 document
processing 9607 document
processing 9608 document
processing 9609 document
processing 9610 document
processing 9611 document
processing 9612 document
processing 9613 document
processing 9614 document
processing 9615 document
processing 9616 document
processing 9617 document
processing 9618 document
processing 9619 document
processing 9620 document
processing 9621 document
processing 9622 document
processing 9623 document
processing 9624 document
processing 9625 document
processing 9626 document
processing 9627 document
processing 9628 document
processing 9629 document
processing 9630 document
processing 9631 document
processing 9632 document
processing 9633 document


processing 9924 document
processing 9925 document
processing 9926 document
processing 9927 document
processing 9928 document
processing 9929 document
processing 9930 document
processing 9931 document
processing 9932 document
processing 9933 document
processing 9934 document
processing 9935 document
processing 9936 document
processing 9937 document
processing 9938 document
processing 9939 document
processing 9940 document
processing 9941 document
processing 9942 document
processing 9943 document
processing 9944 document
processing 9945 document
processing 9946 document
processing 9947 document
processing 9948 document
processing 9949 document
processing 9950 document
processing 9951 document
processing 9952 document
processing 9953 document
processing 9954 document
processing 9955 document
processing 9956 document
processing 9957 document
processing 9958 document
processing 9959 document
processing 9960 document
processing 9961 document
processing 9962 document
processing 9963 document


processing 10245 document
processing 10246 document
processing 10247 document
processing 10248 document
processing 10249 document
processing 10250 document
processing 10251 document
processing 10252 document
processing 10253 document
processing 10254 document
processing 10255 document
processing 10256 document
processing 10257 document
processing 10258 document
processing 10259 document
processing 10260 document
processing 10261 document
processing 10262 document
processing 10263 document
processing 10264 document
processing 10265 document
processing 10266 document
processing 10267 document
processing 10268 document
processing 10269 document
processing 10270 document
processing 10271 document
processing 10272 document
processing 10273 document
processing 10274 document
processing 10275 document
processing 10276 document
processing 10277 document
processing 10278 document
processing 10279 document
processing 10280 document
processing 10281 document
processing 10282 document
processing 1

processing 10562 document
processing 10563 document
processing 10564 document
processing 10565 document
processing 10566 document
processing 10567 document
processing 10568 document
processing 10569 document
processing 10570 document
processing 10571 document
processing 10572 document
processing 10573 document
processing 10574 document
processing 10575 document
processing 10576 document
processing 10577 document
processing 10578 document
processing 10579 document
processing 10580 document
processing 10581 document
processing 10582 document
processing 10583 document
processing 10584 document
processing 10585 document
processing 10586 document
processing 10587 document
processing 10588 document
processing 10589 document
processing 10590 document
processing 10591 document
processing 10592 document
processing 10593 document
processing 10594 document
processing 10595 document
processing 10596 document
processing 10597 document
processing 10598 document
processing 10599 document
processing 1

processing 10879 document
processing 10880 document
processing 10881 document
processing 10882 document
processing 10883 document
processing 10884 document
processing 10885 document
processing 10886 document
processing 10887 document
processing 10888 document
processing 10889 document
processing 10890 document
processing 10891 document
processing 10892 document
processing 10893 document
processing 10894 document
processing 10895 document
processing 10896 document
processing 10897 document
processing 10898 document
processing 10899 document
processing 10900 document
processing 10901 document
processing 10902 document
processing 10903 document
processing 10904 document
processing 10905 document
processing 10906 document
processing 10907 document
processing 10908 document
processing 10909 document
processing 10910 document
processing 10911 document
processing 10912 document
processing 10913 document
processing 10914 document
processing 10915 document
processing 10916 document
processing 1

processing 11197 document
processing 11198 document
processing 11199 document
processing 11200 document
processing 11201 document
processing 11202 document
processing 11203 document
processing 11204 document
processing 11205 document
processing 11206 document
processing 11207 document
processing 11208 document
processing 11209 document
processing 11210 document
processing 11211 document
processing 11212 document
processing 11213 document
processing 11214 document
processing 11215 document
processing 11216 document
processing 11217 document
processing 11218 document
processing 11219 document
processing 11220 document
processing 11221 document
processing 11222 document
processing 11223 document
processing 11224 document
processing 11225 document
processing 11226 document
processing 11227 document
processing 11228 document
processing 11229 document
processing 11230 document
processing 11231 document
processing 11232 document
processing 11233 document
processing 11234 document
processing 1

processing 11515 document
processing 11516 document
processing 11517 document
processing 11518 document
processing 11519 document
processing 11520 document
processing 11521 document
processing 11522 document
processing 11523 document
processing 11524 document
processing 11525 document
processing 11526 document
processing 11527 document
processing 11528 document
processing 11529 document
processing 11530 document
processing 11531 document
processing 11532 document
processing 11533 document
processing 11534 document
processing 11535 document
processing 11536 document
processing 11537 document
processing 11538 document
processing 11539 document
processing 11540 document
processing 11541 document
processing 11542 document
processing 11543 document
processing 11544 document
processing 11545 document
processing 11546 document
processing 11547 document
processing 11548 document
processing 11549 document
processing 11550 document
processing 11551 document
processing 11552 document
processing 1

processing 11831 document
processing 11832 document
processing 11833 document
processing 11834 document
processing 11835 document
processing 11836 document
processing 11837 document
processing 11838 document
processing 11839 document
processing 11840 document
processing 11841 document
processing 11842 document
processing 11843 document
processing 11844 document
processing 11845 document
processing 11846 document
processing 11847 document
processing 11848 document
processing 11849 document
processing 11850 document
processing 11851 document
processing 11852 document
processing 11853 document
processing 11854 document
processing 11855 document
processing 11856 document
processing 11857 document
processing 11858 document
processing 11859 document
processing 11860 document
processing 11861 document
processing 11862 document
processing 11863 document
processing 11864 document
processing 11865 document
processing 11866 document
processing 11867 document
processing 11868 document
processing 1

processing 12149 document
processing 12150 document
processing 12151 document
processing 12152 document
processing 12153 document
processing 12154 document
processing 12155 document
processing 12156 document
processing 12157 document
processing 12158 document
processing 12159 document
processing 12160 document
processing 12161 document
processing 12162 document
processing 12163 document
processing 12164 document
processing 12165 document
processing 12166 document
processing 12167 document
processing 12168 document
processing 12169 document
processing 12170 document
processing 12171 document
processing 12172 document
processing 12173 document
processing 12174 document
processing 12175 document
processing 12176 document
processing 12177 document
processing 12178 document
processing 12179 document
processing 12180 document
processing 12181 document
processing 12182 document
processing 12183 document
processing 12184 document
processing 12185 document
processing 12186 document
processing 1

processing 12467 document
processing 12468 document
processing 12469 document
processing 12470 document
processing 12471 document
processing 12472 document
processing 12473 document
processing 12474 document
processing 12475 document
processing 12476 document
processing 12477 document
processing 12478 document
processing 12479 document
processing 12480 document
processing 12481 document
processing 12482 document
processing 12483 document
processing 12484 document
processing 12485 document
processing 12486 document
processing 12487 document
processing 12488 document
processing 12489 document
processing 12490 document
processing 12491 document
processing 12492 document
processing 12493 document
processing 12494 document
processing 12495 document
processing 12496 document
processing 12497 document
processing 12498 document
processing 12499 document
processing 12500 document
processing 12501 document
processing 12502 document
processing 12503 document
processing 12504 document
processing 1

processing 12784 document
processing 12785 document
processing 12786 document
processing 12787 document
processing 12788 document
processing 12789 document
processing 12790 document
processing 12791 document
processing 12792 document
processing 12793 document
processing 12794 document
processing 12795 document
processing 12796 document
processing 12797 document
processing 12798 document
processing 12799 document
processing 12800 document
processing 12801 document
processing 12802 document
processing 12803 document
processing 12804 document
processing 12805 document
processing 12806 document
processing 12807 document
processing 12808 document
processing 12809 document
processing 12810 document
processing 12811 document
processing 12812 document
processing 12813 document
processing 12814 document
processing 12815 document
processing 12816 document
processing 12817 document
processing 12818 document
processing 12819 document
processing 12820 document
processing 12821 document
processing 1

processing 13102 document
processing 13103 document
processing 13104 document
processing 13105 document
processing 13106 document
processing 13107 document
processing 13108 document
processing 13109 document
processing 13110 document
processing 13111 document
processing 13112 document
processing 13113 document
processing 13114 document
processing 13115 document
processing 13116 document
processing 13117 document
processing 13118 document
processing 13119 document
processing 13120 document
processing 13121 document
processing 13122 document
processing 13123 document
processing 13124 document
processing 13125 document
processing 13126 document
processing 13127 document
processing 13128 document
processing 13129 document
processing 13130 document
processing 13131 document
processing 13132 document
processing 13133 document
processing 13134 document
processing 13135 document
processing 13136 document
processing 13137 document
processing 13138 document
processing 13139 document
processing 1

processing 13418 document
processing 13419 document
processing 13420 document
processing 13421 document
processing 13422 document
processing 13423 document
processing 13424 document
processing 13425 document
processing 13426 document
processing 13427 document
processing 13428 document
processing 13429 document
processing 13430 document
processing 13431 document
processing 13432 document
processing 13433 document
processing 13434 document
processing 13435 document
processing 13436 document
processing 13437 document
processing 13438 document
processing 13439 document
processing 13440 document
processing 13441 document
processing 13442 document
processing 13443 document
processing 13444 document
processing 13445 document
processing 13446 document
processing 13447 document
processing 13448 document
processing 13449 document
processing 13450 document
processing 13451 document
processing 13452 document
processing 13453 document
processing 13454 document
processing 13455 document
processing 1

processing 13736 document
processing 13737 document
processing 13738 document
processing 13739 document
processing 13740 document
processing 13741 document
processing 13742 document
processing 13743 document
processing 13744 document
processing 13745 document
processing 13746 document
processing 13747 document
processing 13748 document
processing 13749 document
processing 13750 document
processing 13751 document
processing 13752 document
processing 13753 document
processing 13754 document
processing 13755 document
processing 13756 document
processing 13757 document
processing 13758 document
processing 13759 document
processing 13760 document
processing 13761 document
processing 13762 document
processing 13763 document
processing 13764 document
processing 13765 document
processing 13766 document
processing 13767 document
processing 13768 document
processing 13769 document
processing 13770 document
processing 13771 document
processing 13772 document
processing 13773 document
processing 1

processing 14052 document
processing 14053 document
processing 14054 document
processing 14055 document
processing 14056 document
processing 14057 document
processing 14058 document
processing 14059 document
processing 14060 document
processing 14061 document
processing 14062 document
processing 14063 document
processing 14064 document
processing 14065 document
processing 14066 document
processing 14067 document
processing 14068 document
processing 14069 document
processing 14070 document
processing 14071 document
processing 14072 document
processing 14073 document
processing 14074 document
processing 14075 document
processing 14076 document
processing 14077 document
processing 14078 document
processing 14079 document
processing 14080 document
processing 14081 document
processing 14082 document
processing 14083 document
processing 14084 document
processing 14085 document
processing 14086 document
processing 14087 document
processing 14088 document
processing 14089 document
processing 1

processing 14369 document
processing 14370 document
processing 14371 document
processing 14372 document
processing 14373 document
processing 14374 document
processing 14375 document
processing 14376 document
processing 14377 document
processing 14378 document
processing 14379 document
processing 14380 document
processing 14381 document
processing 14382 document
processing 14383 document
processing 14384 document
processing 14385 document
processing 14386 document
processing 14387 document
processing 14388 document
processing 14389 document
processing 14390 document
processing 14391 document
processing 14392 document
processing 14393 document
processing 14394 document
processing 14395 document
processing 14396 document
processing 14397 document
processing 14398 document
processing 14399 document
processing 14400 document
processing 14401 document
processing 14402 document
processing 14403 document
processing 14404 document
processing 14405 document
processing 14406 document
processing 1

processing 14686 document
processing 14687 document
processing 14688 document
processing 14689 document
processing 14690 document
processing 14691 document
processing 14692 document
processing 14693 document
processing 14694 document
processing 14695 document
processing 14696 document
processing 14697 document
processing 14698 document
processing 14699 document
processing 14700 document
processing 14701 document
processing 14702 document
processing 14703 document
processing 14704 document
processing 14705 document
processing 14706 document
processing 14707 document
processing 14708 document
processing 14709 document
processing 14710 document
processing 14711 document
processing 14712 document
processing 14713 document
processing 14714 document
processing 14715 document
processing 14716 document
processing 14717 document
processing 14718 document
processing 14719 document
processing 14720 document
processing 14721 document
processing 14722 document
processing 14723 document
processing 1

In [38]:
data['text']

['xref cantaloupe srv cs cmu edu alt atheism alt atheism moderated news answers alt answers path cantaloupe srv cs cmu edu crabapple srv cs cmu edu bb andrew cmu edu news sei cmu edu cis ohio state edu magnus acs ohio state edu usenet ins cwru edu agate spool mu edu uunet pipex ibmpcug mantis mathew from mathew mathew mantis co uk newsgroups alt atheism alt atheism moderated news answers alt answers subject alt atheism faq atheist resources summary books addresses music anything related to atheism keywords faq atheism books music fiction addresses contacts message id mantis co uk date mon mar gmt expires thu apr gmt followup to alt atheism distribution world organization mantis consultants cambridge uk approved news answers request mit edu supersedes mantis co uk lines archive name atheism resources alt atheism archive name resources last modified december version atheist resources addresses of atheist organizations usa freedom from religion foundation darwin fish bumper stickers and a

In [39]:
# save the model to disk
import pickle
filename = 'data.pkl'
pickle.dump(data, open(filename, 'wb'))

In [40]:
#import pickle
#data = pickle.load(open('data.pkl', 'rb'))

In [41]:
pd.DataFrame(data).head()

Unnamed: 0,index,text,target,word_count,clean_text_stemmed,clean_text_lemmatized,text_stemmed,text_lemmatized
0,0,xref cantaloupe srv cs cmu edu alt atheism alt...,0,1772,"[cantaloup, srv, cs, cmu, edu, alt, atheism, a...","[cantaloupe, srv, cs, cmu, edu, alt, atheism, ...","[xref, cantaloup, srv, cs, cmu, edu, alt, athe...","[xref, cantaloupe, srv, cs, cmu, edu, alt, ath..."
1,1,xref cantaloupe srv cs cmu edu alt atheism alt...,0,5425,"[cantaloup, srv, cs, cmu, edu, alt, atheism, a...","[cantaloupe, srv, cs, cmu, edu, alt, atheism, ...","[xref, cantaloup, srv, cs, cmu, edu, alt, athe...","[xref, cantaloupe, srv, cs, cmu, edu, alt, ath..."
2,2,newsgroups alt atheism path cantaloupe srv cs ...,0,806,"[newsgroup, alt, atheism, path, cantaloup, srv...","[newsgroup, alt, atheism, path, cantaloupe, sr...","[newsgroup, alt, atheism, path, cantaloup, srv...","[newsgroup, alt, atheism, path, cantaloupe, sr..."
3,3,xref cantaloupe srv cs cmu edu alt atheism alt...,0,325,"[cantaloup, srv, cs, cmu, edu, alt, atheism, a...","[cantaloupe, srv, cs, cmu, edu, alt, atheism, ...","[xref, cantaloup, srv, cs, cmu, edu, alt, athe...","[xref, cantaloupe, srv, cs, cmu, edu, alt, ath..."
4,4,xref cantaloupe srv cs cmu edu alt atheism soc...,0,206,"[cantaloup, srv, cs, cmu, edu, alt, atheism, s...","[cantaloupe, srv, cs, cmu, edu, alt, atheism, ...","[xref, cantaloup, srv, cs, cmu, edu, alt, athe...","[xref, cantaloupe, srv, cs, cmu, edu, alt, ath..."


In [42]:
data['clean_text_lemmatized']

[['cantaloupe',
  'srv',
  'cs',
  'cmu',
  'edu',
  'alt',
  'atheism',
  'alt',
  'atheism',
  'moderate',
  'news',
  'answer',
  'alt',
  'answer',
  'path',
  'cantaloupe',
  'srv',
  'cs',
  'cmu',
  'edu',
  'crabapple',
  'srv',
  'cs',
  'cmu',
  'edu',
  'bb',
  'andrew',
  'cmu',
  'edu',
  'news',
  'sei',
  'cmu',
  'edu',
  'cis',
  'ohio',
  'state',
  'edu',
  'magnus',
  'acs',
  'ohio',
  'state',
  'edu',
  'usenet',
  'in',
  'cwru',
  'edu',
  'agate',
  'spool',
  'mu',
  'edu',
  'uunet',
  'pipex',
  'ibmpcug',
  'mantis',
  'mathew',
  'mathew',
  'mathew',
  'mantis',
  'co',
  'uk',
  'newsgroup',
  'alt',
  'atheism',
  'alt',
  'atheism',
  'moderate',
  'news',
  'answer',
  'alt',
  'answer',
  'subject',
  'alt',
  'atheism',
  'faq',
  'atheist',
  'resource',
  'summary',
  'book',
  'address',
  'music',
  'relate',
  'atheism',
  'keyword',
  'faq',
  'atheism',
  'book',
  'music',
  'fiction',
  'address',
  'contact',
  'message',
  'd',
  'mantis

In [43]:
print("Data Type: ",type(data['text']))
print("Data Type: ",type(data['clean_text_stemmed']))

print("Length of data: ",len(data['text']))
print("Length of data: ",len(data['clean_text_stemmed']))

Data Type:  <class 'list'>
Data Type:  <class 'list'>
Length of data:  14996
Length of data:  14996


In [44]:
print(data['text'][1])
print("************************************************************")

print("\n clean_text_stemmed \n")
print(data['clean_text_stemmed'][1])

print("************************************************************")
print("\n clean_text_lemmatized \n")
print(data['clean_text_lemmatized'][1])

xref cantaloupe srv cs cmu edu alt atheism alt atheism moderated news answers alt answers path cantaloupe srv cs cmu edu crabapple srv cs cmu edu fs ece cmu edu europa eng gtefsd com howland reston ans net agate netsys ibmpcug mantis mathew from mathew mathew mantis co uk newsgroups alt atheism alt atheism moderated news answers alt answers subject alt atheism faq introduction to atheism summary please read this file before posting to alt atheism keywords faq atheism message id mantis co uk date mon apr gmt expires thu may gmt followup to alt atheism distribution world organization mantis consultants cambridge uk approved news answers request mit edu supersedes mantis co uk lines archive name atheism introduction alt atheism archive name introduction last modified april version begin pgp signed message an introduction to atheism by mathew mathew mantis co uk this article attempts to provide a general introduction to atheism whilst i have tried to be as neutral as possible regarding con

# TF-IDF

In [45]:
## tfidf vectorizer needs sentence and not token. Hence we need to combine all the tokens back to form a string

data['clean_text_stemmed'] = [' '.join(text) for text in data['clean_text_stemmed']]
data['clean_text_lemmatized'] = [' '.join(text) for text in data['clean_text_lemmatized']]
data['clean_text_lemmatized'][0]

'cantaloupe srv cs cmu edu alt atheism alt atheism moderate news answer alt answer path cantaloupe srv cs cmu edu crabapple srv cs cmu edu bb andrew cmu edu news sei cmu edu cis ohio state edu magnus acs ohio state edu usenet in cwru edu agate spool mu edu uunet pipex ibmpcug mantis mathew mathew mathew mantis co uk newsgroup alt atheism alt atheism moderate news answer alt answer subject alt atheism faq atheist resource summary book address music relate atheism keyword faq atheism book music fiction address contact message d mantis co uk date mon mar gmt expire thu apr gmt followup alt atheism distribution world organization mantis consultant cambridge uk approve news answer request mit edu supersede mantis co uk line archive atheism resource alt atheism archive resource modify december version atheist resource address atheist organization usa freedom religion foundation darwin fish bumper sticker assorted atheist paraphernalia available freedom religion foundation write ffrf p o box 

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.95,
                                 min_df=0.05,
                                 use_idf=True, ngram_range=(1,4))  
#got 65% val accuracy with max_df=0.9,min_df=0.2
#got 78% val accuracy with max_df=0.95,min_df=0.1
#got 94% val accuracy with max_df=0.95,min_df=0.05
#tfidf_vectorizer = TfidfVectorizer()    giving 1 lakh features
tfidf_matrix = tfidf_vectorizer.fit_transform(data['clean_text_lemmatized'])

print(tfidf_matrix.shape)

(14996, 823)


In [47]:
tfidf_matrix1=pd.DataFrame(tfidf_matrix.toarray(), columns= tfidf_vectorizer.get_feature_names()) # Array mapping from feature integer indices to feature name

In [48]:
tfidf_matrix1.head()

Unnamed: 0,able,ac,accept,access,account,act,action,actually,add,address,...,write article,write article apr,wrong,wupost,year,yes,zaphod,zaphod mp,zaphod mp ohio,zaphod mp ohio state
0,0.0,0.0,0.0,0.0,0.02742,0.0,0.0,0.0,0.0,0.132597,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.00826,0.0,0.01834,0.0,0.0,0.0568,0.028436,0.02303,0.0,0.0,...,0.0,0.0,0.024468,0.0,0.012066,0.008207,0.0,0.0,0.0,0.0
2,0.0,0.0,0.09392,0.0,0.049154,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.098044,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.073107,0.0,0.0,0.0,0.0,0.0,0.0


In [49]:
terms = tfidf_vectorizer.get_feature_names()
print(type(terms))
terms[:5]

<class 'list'>


['able', 'ac', 'accept', 'access', 'account']

In [50]:
#Duplicating the original text extracted before proceeeding with preprocessing steps

import copy
train_data = copy.deepcopy(tfidf_matrix1)

In [51]:
train_data['target']=data['target']

In [52]:
train_data.head()

Unnamed: 0,able,ac,accept,access,account,act,action,actually,add,address,...,write article apr,wrong,wupost,year,yes,zaphod,zaphod mp,zaphod mp ohio,zaphod mp ohio state,target
0,0.0,0.0,0.0,0.0,0.02742,0.0,0.0,0.0,0.0,0.132597,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0.00826,0.0,0.01834,0.0,0.0,0.0568,0.028436,0.02303,0.0,0.0,...,0.0,0.024468,0.0,0.012066,0.008207,0.0,0.0,0.0,0.0,0
2,0.0,0.0,0.09392,0.0,0.049154,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.098044,0.0,0.0,0.0,0.0,0.0,0.0,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.073107,0.0,0.0,0.0,0.0,0.0,0.0,0


# Splitting Data into Train and Validation

In [53]:
train_data_new=train_data.sample(frac=0.8,random_state=200) #random state is a seed value
val_data=train_data.drop(train_data_new.index)

In [54]:
#Performing train val split on the data
X_train, y_train = train_data_new.loc[:,train_data_new.columns!='target'], train_data_new.loc[:,'target']

X_val, y_val = val_data.loc[:,val_data.columns!='target'], val_data.loc[:,'target']

In [55]:
print(X_train.shape)
print(y_train.shape)
print(X_val.shape)
print(y_val.shape)

(11997, 823)
(11997,)
(2999, 823)
(2999,)


# Preparation for Model buillding



## Classification report

In [56]:
def get_CR_CM(train_actual,train_predicted,test_actual,test_predicted):
    print('''
         ========================================
           CLASSIFICATION REPORT FOR TRAIN DATA
         ========================================
        ''')
    print(classification_report(train_actual, train_predicted, digits=4))

    print('''
             =============================================
               CLASSIFICATION REPORT FOR VALIDATION DATA
             =============================================
            ''')
    print(classification_report(test_actual, test_predicted, digits=4))
    

## Function to calculate accuracy, recall, precision and F1 score# 

In [57]:
scores = pd.DataFrame(columns=['Model','Train_Accuracy','Train_Recall','Train_Precision','Train_F1_Score','Test_Accuracy','Test_Recall','Test_Precision','Test_F1_Score'])

def get_metrics(train_actual,train_predicted,test_actual,test_predicted,model_description,dataframe):
    get_CR_CM(train_actual,train_predicted,test_actual,test_predicted)
    train_accuracy = accuracy_score(train_actual,train_predicted)
    train_recall   = recall_score(train_actual,train_predicted,average ='weighted')
    train_precision= precision_score(train_actual,train_predicted,average ='weighted')
    train_f1score  = f1_score(train_actual,train_predicted,average ='weighted')
    test_accuracy = accuracy_score(test_actual,test_predicted)
    test_recall   = recall_score(test_actual,test_predicted,average ='weighted')
    test_precision= precision_score(test_actual,test_predicted,average ='weighted')
    test_f1score  = f1_score(test_actual,test_predicted,average ='weighted')
    dataframe = dataframe.append(pd.Series([model_description, train_accuracy,train_recall,train_precision,train_f1score,
                                            test_accuracy,test_recall,test_precision,test_f1score],
                                           index=scores.columns ), ignore_index=True)
    return(dataframe)

# MODEL BUILDING

## Logistic Regression

In [58]:
from sklearn.linear_model import LogisticRegression
log_mod = LogisticRegression(random_state=123)

In [59]:
log_mod.fit(X_train, y_train)

LogisticRegression(random_state=123)

In [60]:
y_pred_train = log_mod.predict(X_train)
y_pred_val = log_mod.predict(X_val)

##### Evaluating the model performance

In [64]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score

In [65]:
scores = get_metrics(y_train,y_pred_train,y_val,y_pred_val,"Logistic",scores)
scores


           CLASSIFICATION REPORT FOR TRAIN DATA
        
              precision    recall  f1-score   support

           0     0.9047    0.8452    0.8739       730
           1     0.9607    0.9402    0.9503       468
           2     0.9705    0.9825    0.9765       570
           3     0.9574    0.9836    0.9704       549
           4     0.9721    0.9868    0.9794       530
           5     0.9562    0.9544    0.9553       526
           6     0.9748    0.9169    0.9450       337
           7     0.9741    0.9878    0.9809       572
           8     0.9984    1.0000    0.9992       610
           9     0.9981    0.9981    0.9981       539
          10     0.9983    0.9983    0.9983       589
          11     0.9957    0.9871    0.9914       697
          12     0.9683    0.9752    0.9717       564
          13     0.9916    0.9866    0.9891       599
          14     0.9936    0.9952    0.9944       627
          15     0.9953    0.9953    0.9953       638
          16     0.9375

Unnamed: 0,Model,Train_Accuracy,Train_Recall,Train_Precision,Train_F1_Score,Test_Accuracy,Test_Recall,Test_Precision,Test_F1_Score
0,Logistic,0.955155,0.955155,0.956565,0.955266,0.937312,0.937312,0.938403,0.937355


# Ridge Regression

In [66]:
from sklearn.linear_model import RidgeClassifierCV

In [67]:
#### TYPE
ridge_model=RidgeClassifierCV(fit_intercept=True, alphas=[0.0125, 0.025, 0.05,.1, .125, .25, .5, 1., 2., 4.,10,100])

In [68]:
ridge_model.fit(X_train,y_train)

RidgeClassifierCV(alphas=array([1.25e-02, 2.50e-02, 5.00e-02, 1.00e-01, 1.25e-01, 2.50e-01,
       5.00e-01, 1.00e+00, 2.00e+00, 4.00e+00, 1.00e+01, 1.00e+02]))

In [69]:
y_pred_train = ridge_model.predict(X_train)
y_pred_val = ridge_model.predict(X_val)

##### Evaluating the model performance

In [70]:
scores = get_metrics(y_train,y_pred_train,y_val,y_pred_val,"Ridge",scores)
scores


           CLASSIFICATION REPORT FOR TRAIN DATA
        
              precision    recall  f1-score   support

           0     0.9665    0.7904    0.8696       730
           1     0.9626    0.9338    0.9479       468
           2     0.9823    0.9737    0.9780       570
           3     0.9627    0.9872    0.9748       549
           4     0.9563    0.9906    0.9731       530
           5     0.9669    0.9430    0.9548       526
           6     0.9784    0.9407    0.9592       337
           7     0.9809    0.9878    0.9843       572
           8     1.0000    0.9967    0.9984       610
           9     1.0000    0.9981    0.9991       539
          10     0.9983    1.0000    0.9992       589
          11     0.9957    0.9943    0.9950       697
          12     0.9668    0.9805    0.9736       564
          13     0.9818    0.9900    0.9859       599
          14     0.9968    0.9968    0.9968       627
          15     1.0000    1.0000    1.0000       638
          16     0.9375

Unnamed: 0,Model,Train_Accuracy,Train_Recall,Train_Precision,Train_F1_Score,Test_Accuracy,Test_Recall,Test_Precision,Test_F1_Score
0,Logistic,0.955155,0.955155,0.956565,0.955266,0.937312,0.937312,0.938403,0.937355
1,Ridge,0.956656,0.956656,0.961149,0.956864,0.949983,0.949983,0.953609,0.950081


# LASSO

In [71]:
lasso=LogisticRegression(penalty='l1', solver='liblinear')

In [72]:
lasso.fit(X_train,y_train)

LogisticRegression(penalty='l1', solver='liblinear')

In [73]:
y_pred_train = lasso.predict(X_train)
y_pred_val = lasso.predict(X_val)

##### Evaluating the model performance

In [74]:
scores = get_metrics(y_train,y_pred_train,y_val,y_pred_val,"LASSO",scores)
scores


           CLASSIFICATION REPORT FOR TRAIN DATA
        
              precision    recall  f1-score   support

           0     0.9452    0.8041    0.8690       730
           1     0.9442    0.9402    0.9422       468
           2     0.9724    0.9877    0.9800       570
           3     0.9729    0.9818    0.9773       549
           4     0.9793    0.9830    0.9812       530
           5     0.9706    0.9430    0.9566       526
           6     0.9611    0.9525    0.9568       337
           7     0.9843    0.9895    0.9869       572
           8     1.0000    1.0000    1.0000       610
           9     1.0000    0.9963    0.9981       539
          10     0.9983    1.0000    0.9992       589
          11     0.9957    0.9928    0.9943       697
          12     0.9684    0.9787    0.9735       564
          13     0.9883    0.9883    0.9883       599
          14     0.9921    0.9984    0.9952       627
          15     0.9938    1.0000    0.9969       638
          16     0.9347

Unnamed: 0,Model,Train_Accuracy,Train_Recall,Train_Precision,Train_F1_Score,Test_Accuracy,Test_Recall,Test_Precision,Test_F1_Score
0,Logistic,0.955155,0.955155,0.956565,0.955266,0.937312,0.937312,0.938403,0.937355
1,Ridge,0.956656,0.956656,0.961149,0.956864,0.949983,0.949983,0.953609,0.950081
2,LASSO,0.955656,0.955656,0.959101,0.955841,0.952651,0.952651,0.955385,0.952676


# Naive Bayes

In [75]:
from sklearn import model_selection, naive_bayes, svm

In [76]:
# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(X_train,y_train)

# predict the labels on train dataset
y_pred_train = Naive.predict(X_train)

# predict the labels on validation dataset
y_pred_val = Naive.predict(X_val)

In [77]:
scores = get_metrics(y_train,y_pred_train,y_val,y_pred_val,"Naive Bayes",scores)
scores


           CLASSIFICATION REPORT FOR TRAIN DATA
        
              precision    recall  f1-score   support

           0     0.7796    0.8479    0.8123       730
           1     0.9295    0.7885    0.8532       468
           2     0.9298    0.9754    0.9521       570
           3     0.8889    0.9909    0.9371       549
           4     0.9210    0.9453    0.9330       530
           5     0.8834    0.8498    0.8663       526
           6     0.9478    0.6469    0.7690       337
           7     0.8797    0.8951    0.8873       572
           8     0.9005    0.9492    0.9242       610
           9     0.9876    0.8868    0.9345       539
          10     0.9065    0.9881    0.9456       589
          11     0.9413    0.9656    0.9533       697
          12     0.8308    0.7837    0.8066       564
          13     0.9410    0.8781    0.9085       599
          14     0.8737    0.9601    0.9149       627
          15     0.9830    0.9984    0.9907       638
          16     0.8393

Unnamed: 0,Model,Train_Accuracy,Train_Recall,Train_Precision,Train_F1_Score,Test_Accuracy,Test_Recall,Test_Precision,Test_F1_Score
0,Logistic,0.955155,0.955155,0.956565,0.955266,0.937312,0.937312,0.938403,0.937355
1,Ridge,0.956656,0.956656,0.961149,0.956864,0.949983,0.949983,0.953609,0.950081
2,LASSO,0.955656,0.955656,0.959101,0.955841,0.952651,0.952651,0.955385,0.952676
3,Naive Bayes,0.885721,0.885721,0.886548,0.884115,0.872291,0.872291,0.874986,0.871139


# SVM

In [78]:
from sklearn.svm import SVC

In [79]:
svc_model=SVC()
#svc_line.set_params(classifier__kernel='linear',classifier__C=1,classifier__random_state=123)

In [80]:
svc_model.fit(X_train,y_train)

SVC()

In [81]:
y_pred_train = svc_model.predict(X_train)
y_pred_val = svc_model.predict(X_val)

In [82]:
scores = get_metrics(y_train,y_pred_train,y_val,y_pred_val,"SVM",scores)
scores


           CLASSIFICATION REPORT FOR TRAIN DATA
        
              precision    recall  f1-score   support

           0     0.9767    0.8027    0.8812       730
           1     0.9847    0.9658    0.9752       468
           2     0.9861    0.9947    0.9904       570
           3     0.9647    0.9945    0.9794       549
           4     0.9943    0.9906    0.9924       530
           5     0.9885    0.9829    0.9857       526
           6     0.9969    0.9674    0.9819       337
           7     0.9913    0.9965    0.9939       572
           8     1.0000    1.0000    1.0000       610
           9     1.0000    0.9981    0.9991       539
          10     0.9983    1.0000    0.9992       589
          11     0.9986    0.9971    0.9978       697
          12     0.9893    0.9876    0.9885       564
          13     0.9966    0.9933    0.9950       599
          14     0.9952    0.9984    0.9968       627
          15     1.0000    1.0000    1.0000       638
          16     0.9427

Unnamed: 0,Model,Train_Accuracy,Train_Recall,Train_Precision,Train_F1_Score,Test_Accuracy,Test_Recall,Test_Precision,Test_F1_Score
0,Logistic,0.955155,0.955155,0.956565,0.955266,0.937312,0.937312,0.938403,0.937355
1,Ridge,0.956656,0.956656,0.961149,0.956864,0.949983,0.949983,0.953609,0.950081
2,LASSO,0.955656,0.955656,0.959101,0.955841,0.952651,0.952651,0.955385,0.952676
3,Naive Bayes,0.885721,0.885721,0.886548,0.884115,0.872291,0.872291,0.874986,0.871139
4,SVM,0.966075,0.966075,0.970738,0.966278,0.941647,0.941647,0.945566,0.941824


## Random Search CV

In [83]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [84]:
clf_svc=Pipeline(steps=[('classifier',SVC())])

In [85]:

svc_param_random={'classifier__C':[0.001,0.01,0.1,1,10,100],'classifier__gamma':[0,0.0001,0.01,0.1,1,10,100],
               "classifier__kernel":['linear','rbf','poly']}

In [86]:
svc_random=RandomizedSearchCV(clf_svc,param_distributions=svc_param_random,cv=3)

In [87]:
%%time
svc_random.fit(X_train,y_train)

Wall time: 1h 30min 19s


RandomizedSearchCV(cv=3, estimator=Pipeline(steps=[('classifier', SVC())]),
                   param_distributions={'classifier__C': [0.001, 0.01, 0.1, 1,
                                                          10, 100],
                                        'classifier__gamma': [0, 0.0001, 0.01,
                                                              0.1, 1, 10, 100],
                                        'classifier__kernel': ['linear', 'rbf',
                                                               'poly']})

In [88]:
# save the model to disk
import pickle
filename = 'svc_random.pkl'
pickle.dump(svc_random, open(filename, 'wb'))

In [89]:
y_pred_train = svc_random.predict(X_train)
y_pred_val = svc_random.predict(X_val)

In [90]:
scores = get_metrics(y_train,y_pred_train,y_val,y_pred_val,"SVM_grid",scores)
scores


           CLASSIFICATION REPORT FOR TRAIN DATA
        
              precision    recall  f1-score   support

           0     0.9766    0.8000    0.8795       730
           1     0.9826    0.9658    0.9741       468
           2     0.9861    0.9947    0.9904       570
           3     0.9647    0.9945    0.9794       549
           4     0.9943    0.9906    0.9924       530
           5     0.9885    0.9810    0.9847       526
           6     0.9969    0.9674    0.9819       337
           7     0.9913    0.9965    0.9939       572
           8     1.0000    1.0000    1.0000       610
           9     1.0000    0.9981    0.9991       539
          10     0.9983    1.0000    0.9992       589
          11     0.9986    0.9971    0.9978       697
          12     0.9893    0.9876    0.9885       564
          13     0.9966    0.9933    0.9950       599
          14     0.9952    0.9984    0.9968       627
          15     1.0000    1.0000    1.0000       638
          16     0.9414

Unnamed: 0,Model,Train_Accuracy,Train_Recall,Train_Precision,Train_F1_Score,Test_Accuracy,Test_Recall,Test_Precision,Test_F1_Score
0,Logistic,0.955155,0.955155,0.956565,0.955266,0.937312,0.937312,0.938403,0.937355
1,Ridge,0.956656,0.956656,0.961149,0.956864,0.949983,0.949983,0.953609,0.950081
2,LASSO,0.955656,0.955656,0.959101,0.955841,0.952651,0.952651,0.955385,0.952676
3,Naive Bayes,0.885721,0.885721,0.886548,0.884115,0.872291,0.872291,0.874986,0.871139
4,SVM,0.966075,0.966075,0.970738,0.966278,0.941647,0.941647,0.945566,0.941824
5,SVM_grid,0.965741,0.965741,0.970463,0.96595,0.943981,0.943981,0.947886,0.944131
