## Email Classification using 20 newsgroup dataset

In [1]:
from sklearn import datasets
traindata= datasets.fetch_20newsgroups(subset='train')
testdata= datasets.fetch_20newsgroups(subset='test')

In [2]:
xtrain = traindata.data
ytrain = traindata.target
print(len(xtrain))
print(len(ytrain))

11314
11314


In [3]:
print(traindata.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features      

In [8]:
target_names = traindata.target_names
print(target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [14]:
print(xtrain[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [15]:
print(ytrain[0])

7


In [16]:
target_names[ytrain[0]]

'rec.autos'

In [27]:
import re
import spacy

In [30]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("I went to bring tables for my children and his boxes with knives. Mr. john he worked very hard in from the of here.")

In [32]:
for w in doc:
    if w.lemma_!="-PRON-":
        print(w.lemma_)

I
go
to
bring
table
for
my
child
and
his
box
with
knife
.
Mr.
john
he
work
very
hard
in
from
the
of
here
.


## Cleaning

In [34]:
def regex_cleaning(xdata):
    for i in range(len(xdata)):
        doc = xdata[i]
        doc = re.sub("[a-zA-Z0-9._]+@[a-zA-Z0-9._]+","",doc) # removing email ids
        doc = re.sub("\d[0-9a-zA-Z_]+","",doc) # all words starting with numbers should be dropped
        doc = re.sub("_","",doc)
        xdata[i] = doc
    return xdata


def lemma_cleaning(xdata):
    for i in range(len(xdata)):
        doc = xdata[i]
        doc = nlp(doc)
        doc = " ".join([w.lemma_ for w in doc if w.lemma_!='-PRON-'])
        xdata[i] = doc
    return xdata

In [35]:
def transform(xdata):
    xdata = regex_cleaning(xdata)
    xdata = lemma_cleaning(xdata)
    return xdata

In [36]:
xtrain = transform(xtrain)

## Vectorization

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(lowercase=True,stop_words='english',min_df=5,max_df=0.95)
vec.fit(xtrain)

TfidfVectorizer(max_df=0.95, min_df=5, stop_words='english')

In [39]:
print(len(vec.get_feature_names()))
print(vec.get_feature_names())

17933


In [45]:
# get the vectorized data
x2 = vec.transform(xtrain).toarray()
print(x2.shape)

(11314, 17933)


In [46]:
## Apply ML - Naive Bayes

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()

In [47]:
# trian the model
model.fit(x2,ytrain)

MultinomialNB()

# Performance Analysis

In [48]:
xtest = transform(testdata.data)
ytest = testdata.target

In [49]:
ypred = model.predict(xtest)

from sklearn import metrics
print("Accuracy ",metrics.accuracy_score(ytest,ypred))


ValueError: Expected 2D array, got 1D array instead:
array=['from :    ( NEIL B. GANDLER ) \n  subject : need info on - Bonneville \n  Organization : University at Buffalo \n  line : \n  News - Software : VAX / VMS VNEWS 1 . \n  nntp - posting - host : ubvmsd.cc.buffalo.edu \n\n\n   I be a little confused on all of the model of the - bonneville . \n  I have hear of the LE SE LSE SSE SSEI . could someone tell I the \n  difference be far as feature or performance . I be also curious to \n  know what the book value be for prefereably the    model . and how much \n  less than book value can you usually get they for . in other word how \n  much be they in demand this time of year . I have hear that the mid - spring \n  early summer be the good time to buy . \n\n\t\t\t  Neil Gandler \n'
 'from : Rick Miller < > \n  subject : x - face ? \n  organization : just I . \n  line : \n  distribution : world \n  NNTP - posting - host : .. 2 . \n  summary : go ahead ... swamp I .    < EEP ! > \n\n  I be not familiar at all with the format of these " x - face : " thingie , but \n  after see they in some folk \' header , I \' ve * get * to * see * they ( and \n  maybe make one of my own ) ! \n\n  I \' ve get " dpg - view " on my Linux box ( which display " uncompressed x - face " ) \n  and I \' ve manage to compile [ un]compface too ... but now that I be * look * \n  for they , I ca n\'t seem to find any x - face : be in anyone news header !    :-( \n\n  could you , would you , please send I your " x - face : " header ? \n\n  I * know * I \'ll probably get a little swamp , but I can handle it . \n\n\t  ... I hope . \n\n  Rick Miller    < > | < >     Ricxjo Muelisto \n  send a postcard , get one back ! | Enposxtigu bildkarton kaj vi ricevos alion ! \n            RICK MILLER //    WOODS // MUSKEGO , WIS .    // USA \n'
 'from : mathew < > \n  subject : re : STRONG & weak Atheism \n  Organization : Mantis Consultants , Cambridge . UK . \n  x - Newsreader : rusnew v1 . \n  line : 9 \n\n   ( Turin Turambar , ME Department of Utter Misery ) write : \n  > do that FAQ ever got modify to re - define strong atheist as not those who \n  > assert the nonexistence of God , but as those who assert that they believe in \n  > the nonexistence of God ? \n\n  in a word , yes . \n\n\n  mathew \n'
 ...
 'from : \n  subject : re : Game Length ( be re : Braves Update ! ! \n  Organization : Clark University \n  line : \n\n  I send a version of this post out a while ago , but it be swallow by \n  the void .    my apology if anyone end up receive it . \n\n  Sherri Nichols write : \n\n  > in article < >    write : \n  > > ( Sherri Nichols ) write : \n  > > \n  > > > I just do n\'t \n  > > > happen to think that the - minute add to the length of game over \n  > > > the last    year have add anything interesting . \n  > > \n  > > how would you quantify that ?    I suppose an easy way would be to look at \n  > > attendance figure .    anyone get the number ? \n  > \n  > Attendance figure be n\'t go to quantify anything about my personal \n  > opinion , which the above be clearly state as .    add " to I " to the end of \n  > my sentence , if you be confused about what I mean . \n\n  oh no , I be n\'t confused -- I understand that it be your personal \n  opinion .    but I think we be discuss the need to shorten \n  game .    the argument which declare this need seem to hinge on \n  the assertion that long game bear people and otherwise discourage \n  they from go to the ballpark .    I \'d like to see if the increase \n  length of game have negatively affect attendance .    if it have , then \n  there * be * a problem , and something should be do about it .    if it \n  have n\'t , then there * be n\'t * a problem , and there be no need to monkey \n  with thing as they be . \n\n  heath \n\n'
 "from :  \n  subject : INTEL CHMOS / DESIGN KIT \n  News - Software : VAX / VMS VNEWS 1 .    \n  Nntp - posting - host : utarlg.uta.edu \n  Organization : the University of Texas at Arlington \n  line : \n\n  I have this kit which include the follow : \n\n  1 ) \t  /-5 \n\t  CHMOS clock GENERATOR and DRIVER \n\t  for , PROCESSORS \n  2 ) \t  / \n\t  ( ) CHMOS uv ERASABLE PROM \n  3 ) \t\n\t  LOW power    x 4 \n\t  CHMOS DYNAMIC RAM \n  4 ) \t  -2 \n\t  CHMOS PROGRAMMABLE interrupt CONTROLLER \n  5 ) \t\n\t  CHMOS BUS CONTROLLER \n\t  FRO , PROCESSORS \n  6 ) \t  /-2 \n\t  8 - BIT CHMOS MICROPROCESSOR \n  7 ) \t\n\t  CHMOS PROGRAMMABLE PERIPHERAL INTERFACE \n  8) \t\n\t  CHMOS PROGRAMMABLE interval TIMER \n  9 ) \t\n\t  CHMOS dynamic ram CONTROLLER \n\n  all these be chip with complete manual in a box . I do n't know whether they \n  still work or not , and I do n't really know what they be . \n\n  so this be mainly for those who know what this be and have use of it \n  ( probably ee stuff since this use to belong to a EE student ) \n\n  anyone interested , please make I an offer . \n\n  Jonina \n"
 'from : \n  subject : re : homosexuality issue in Christianity \n  reply - to : \n  organization : GTE Govt . Systems , Electronics Def . Div . \n  line : \n\n  in article < > ,    ( D. Andrew Kille ) write : \n  > of course the whole issue be one of discernment .    it may be that Satan \n  > be try to convince we that we know more than God .    or it may be that \n  > God be try ( as God do with Peter ) to teach we something we do n\'t \n  > know- that " God show no partiality , but in every nation anyone who fear \n  > he and do what be right be acceptable to he . " ( act :-) . \n  > \n  > \n\n  fine , but one of the point of this entire discussion be that " we " \n  ( conservative , reform christian - this could start an argument ... \n  but be n\'t this idea that homosexuality be ok fairly " new " [ this \n  century ] ? be there any support for this be a viable viewpoint \n  before this century ? I do n\'t know . ) do n\'t believe that homosexuality \n  be " acceptable to he " . so your scripture quotation do n\'t work for \n  " we " . \n\n  -jeff adams- \n'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
import pandas as pd
pd.DataFrame(metrics.confusion_matrix(ytest,ypred),columns=target_names,index=target_names)