# Exploratory Data Analysis 

This notebook explores and transforms the data we will be using to train a multi-class classifier.

**Note:** you will not be able to run this notebook yourself unless you download the [raw data](https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html). This notebook specifically uses the '20_newsgroups' data set. 
Downloading this data set is *not* required for the rest of the notebooks in this series, so feel free to simply read through this notebook without executing the cells. 

This data set consists of 20000 messages taken from 20 Usenet newsgroups.

In the next few cells we inspect the contents of some of the files.

In [1]:
f = open('./20_newsgroups/alt.atheism/51121', 'r') 
lines = f.read()
print(lines)
f.close()

Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51121 soc.motss:139944 rec.scouting:5318
Newsgroups: alt.atheism,soc.motss,rec.scouting
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!wupost!uunet!newsgate.watson.ibm.com!yktnews.watson.ibm.com!watson!Watson.Ibm.Com!strom
From: strom@Watson.Ibm.Com (Rob Strom)
Subject: Re: [soc.motss, et al.] "Princeton axes matching funds for Boy Scouts"
Sender: @watson.ibm.com
Message-ID: <1993Apr05.180116.43346@watson.ibm.com>
Date: Mon, 05 Apr 93 18:01:16 GMT
Distribution: usa
References: <C47EFs.3q47@austin.ibm.com> <1993Mar22.033150.17345@cbnewsl.cb.att.com> <N4HY.93Apr5120934@harder.ccr-p.ida.org>
Organization: IBM Research
Lines: 15

In article <N4HY.93Apr5120934@harder.ccr-p.ida.org>, n4hy@harder.ccr-p.ida.org (Bob McGwier) writes:

|> [1] HOWEVER, I hate economic terrorism and political correctness
|> worse than I hate this policy.  


|> [2] A more effective approach is to stop

In [2]:
f = open('./20_newsgroups/comp.graphics/37921', 'r') 
print(lines)
f.close()

Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51121 soc.motss:139944 rec.scouting:5318
Newsgroups: alt.atheism,soc.motss,rec.scouting
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!wupost!uunet!newsgate.watson.ibm.com!yktnews.watson.ibm.com!watson!Watson.Ibm.Com!strom
From: strom@Watson.Ibm.Com (Rob Strom)
Subject: Re: [soc.motss, et al.] "Princeton axes matching funds for Boy Scouts"
Sender: @watson.ibm.com
Message-ID: <1993Apr05.180116.43346@watson.ibm.com>
Date: Mon, 05 Apr 93 18:01:16 GMT
Distribution: usa
References: <C47EFs.3q47@austin.ibm.com> <1993Mar22.033150.17345@cbnewsl.cb.att.com> <N4HY.93Apr5120934@harder.ccr-p.ida.org>
Organization: IBM Research
Lines: 15

In article <N4HY.93Apr5120934@harder.ccr-p.ida.org>, n4hy@harder.ccr-p.ida.org (Bob McGwier) writes:

|> [1] HOWEVER, I hate economic terrorism and political correctness
|> worse than I hate this policy.  


|> [2] A more effective approach is to stop

In [3]:
f = open('./20_newsgroups/comp.graphics/37930', 'r') 
lines = f.read()
print(lines)
f.close()

Xref: cantaloupe.srv.cs.cmu.edu comp.graphics:37930 comp.unix.aix:23730
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!uunet!mcsun!fuug!tahko.lpr.carel.fi!tahko.lpr.carel.fi!not-for-mail
From: ari@tahko.lpr.carel.fi (Ari Suutari)
Newsgroups: comp.graphics,comp.unix.aix
Subject: Any graphics packages available for AIX ?
Date: 6 Apr 1993 10:00:38 +0300
Organization: Carelcomp Oy
Lines: 24
Message-ID: <1pr9qnINNiag@tahko.lpr.carel.fi>
NNTP-Posting-Host: tahko.lpr.carel.fi
Keywords: gks graphics


	Does anybody know if there are any good 2d-graphics packages
	available for IBM RS/6000 & AIX ? I'm looking for something
	like DEC's GKS or Hewlett-Packards Starbase, both of which
	have reasonably good support for different output devices
	like plotters, terminals, X etc.

	I have tried also xgks from X11 distribution and IBM's implementation
	of Phigs. Both of them work but we require more output devices
	than

If we look at the data above we see some common themes: 

1. The classification is given in the 'Newsgroups' field, found in the header. (Note that some messages have multiple classifications, meaning they were cross-posted). 
    - We can also extract the classification from the file name.
2. The message itself starts after a double line break. 
3. There is additional information in the 'Keywords' and the 'Subject' lines. 

We will check a few more files to see if they exhibit the same structure:

In [4]:
f = open('./20_newsgroups/comp.graphics/37916', 'r') 
lines = f.read()
print(lines)
f.close()

Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!gatech!asuvax!cs.utexas.edu!zaphod.mps.ohio-state.edu!saimiri.primate.wisc.edu!usenet.coe.montana.edu!news.u.washington.edu!uw-beaver!cs.ubc.ca!unixg.ubc.ca!kakwa.ucs.ualberta.ca!ersys!joth
From: joth@ersys.edmonton.ab.ca (Joe Tham)
Newsgroups: comp.graphics
Subject: Where can I find SIPP?
Message-ID: <yFXJ2B2w165w@ersys.edmonton.ab.ca>
Date: Mon, 05 Apr 93 14:58:21 MDT
Organization: Edmonton Remote Systems #2, Edmonton, AB, Canada
Lines: 11

        I recently got a file describing a library of rendering routines 
called SIPP (SImple Polygon Processor).  Could anyone tell me where I can 
FTP the source code and which is the newest version around?
        Also, I've never used Renderman so I was wondering if Renderman 
is like SIPP?  ie. a library of rendering routines which one uses to make 
a program that creates the image...

                                        Thanks,  Joe Tham

--
Jo

In [5]:
f = open('./20_newsgroups/misc.forsale/70337', 'r') 
lines = f.read()
print(lines)
f.close()

Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!gatech!howland.reston.ans.net!usc!cs.utexas.edu!qt.cs.utexas.edu!news.Brown.EDU!noc.near.net!bigboote.WPI.EDU!bigwpi.WPI.EDU!kedz
From: kedz@bigwpi.WPI.EDU (John Kedziora)
Newsgroups: misc.forsale
Subject: Motorcycle wanted.
Date: 22 Feb 1993 14:22:51 GMT
Organization: Worcester Polytechnic Institute
Lines: 11
Expires: 5/1/93
Message-ID: <1manjr$ja0@bigboote.WPI.EDU>
NNTP-Posting-Host: bigwpi.wpi.edu

Sender: 
Followup-To:kedz@wpi.wpi.edu 
Distribution: ne
Organization: Worcester Polytechnic Institute
Keywords: 

I am looking for an inexpensive motorcycle, nothing fancy, have to be able to do all maintinence my self. looking in the <$400 range.

if you can help me out, GREAT!, please reply by e-mail.





In [6]:
f = open('./20_newsgroups/misc.forsale/74797', 'r') 
lines = f.read()
print(lines)
f.close()

Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!usenet.coe.montana.edu!decwrl!sun-barr!male.EBay.Sun.COM!dswalker!donald
From: donald@dswalker.EBay.Sun.COM (Don Walker)
Newsgroups: misc.forsale
Subject: Items for SALE
Message-ID: <1ps3s4$6g@male.EBay.Sun.COM>
Date: 6 Apr 93 14:25:08 GMT
Article-I.D.: male.1ps3s4$6g
Reply-To: donald@dswalker.EBay.Sun.COM
Distribution: world
Organization: Sun Microsystems, Inc.
Lines: 19
NNTP-Posting-Host: dswalker.ebay.sun.com


                        ITEMS FOR SALE



1. Howard Miller Clock. It chimes like a grandfather clock. $250

2. Painting- A Tiger in the snow. It is a beautiful painting, the tiger
   looks like it can jump off of the canvas and get you. $200

3. Mens Diamond Ring, size 10 - $500
a. 3 rows of diamonds
b. 18k gold

Call or email me.

Donald Walker
hm 408-263-3709
wk 408-276-3618



## Transforming the data

We want to store the relevant information and data in a `pandas` `DataFrame`.

We will do this by first creating a list containing this information, then we will make a DataFrame from the list. 

For each file we 
1. extract the message by discarding everything before the first line break,
2. extract the `Subject:` line,
2. extract the classification from the filename. 

In [7]:
import os 

In [8]:
listOfFiles = list()
for (dirpath, dirnames, filenames) in os.walk('./20_newsgroups'):
    listOfFiles += [os.path.join(dirpath, file) for file in filenames]

In [9]:
data=[]
passed=[]
for file in listOfFiles:
    
    f = open(file, 'r') 
    try:
        lines = f.read()
        data.append([lines.split('\n\n', 1)[1], lines.split('\nSubject: ', 1)[1].split('\n', 1)[0], file.split('/')[2]])
    except:
        passed.append(file.split('/')[2:])
        pass
    f.close()

In [10]:
len(data)

19924

### Making a data frame

In [11]:
import pandas as pd

In [12]:
df = pd.DataFrame(data, columns = ['Message', 'Subject', 'Category'])

In [13]:
df.sample(10)

Unnamed: 0,Message,Subject,Category
447,"In article <1483500349@igc.apc.org>, cpr@igc.a...",Re: Ten questions about Israel,talk.politics.mideast
12595,In article <1993Apr20.053250.24854@worak.kaist...,Re: WANTED: Multi-page GIF!!,comp.graphics
17806,I am going to stop reading the homosexuality p...,Re: homosexuality,soc.religion.christian
9822,garrod@dynamo.ecn.purdue.edu (David Garrod) wr...,Re: WACO burning,talk.politics.misc
1900,Sayeth sjwyrick@lbl.gov (Steve Wyrick):\n$Anyb...,"Re: Who was or what is MIATA, as used in...",rec.autos
5561,We have heard many bad things about the ATI Ul...,So what is the fastest Windows video card for ...,comp.os.ms-windows.misc
10234,In article <1993Apr15.164644.7348@hemlock.cray...,Re: MOTORCYCLE DETAILING TIP #18,rec.motorcycles
5372,"I look at zApp and really liked it. However, I...",Re: GUI Application Frameworks for Windows ??,comp.os.ms-windows.misc
17075,In <Apr.10.05.31.12.1993.14351@athos.rutgers.e...,Re: Essene New Testament,soc.religion.christian
2742,"Hi all, \n\nI have a IIsi with a floppy drive...",Replacing internal FDHD w/ floptical?,comp.sys.mac.hardware


In the analysis and modeling in the next notebooks we want to treat the Message and the Subject in the same way. As such, we save time and computation by combining those data frame columns now:

In [14]:
df["Text"]=df["Message"]+df["Subject"]
df = df.drop(columns = ["Message", "Subject"])

In [15]:
df.sample(10)

Unnamed: 0,Category,Text
9531,talk.politics.misc,"In article <1r1pit$n7k@lll-winken.llnl.gov>, e..."
11814,comp.windows.x,"In article <1993Apr7.044749.11770@topgun>, smi..."
17989,misc.forsale,"I offer $100, shipment at seller's expense, pa..."
15791,talk.politics.guns,In article <1993Apr20.050550.4660@jupiter.sun....
2410,comp.sys.mac.hardware,\nIn article <1993Apr6.134746.11972@daimi.aau....
2837,comp.sys.mac.hardware,In article <16BB1A4DF.DJCOHEN@YaleVM.YCC.Yale....
7250,sci.crypt,strnlght@netcom.com (David Sternlight) writes:...
14251,sci.electronics,The title says it all. Contact me via EMAIL i...
15141,talk.politics.guns,\nIt's worse than you show it.....look for Jan...
9756,talk.politics.misc,"In article <C5sI9G.Hx@dscomsa.desy.de>, hallam..."


### Splitting the data

We split the data, using 70% as a training set, and the remaining 30% as a testing set. We then save these data frames as parquet files.

In [16]:
train = df.sample(frac=0.7, random_state=504)
test = df.drop(train.index) #everything that isn't in the test set

In [17]:
train.shape

(13947, 2)

In [18]:
test.shape

(5977, 2)

In [19]:
train.to_parquet('data/training.parquet')

In [20]:
test.to_parquet('data/testing.parquet')