# Exploratory Data Analysis 

This notebook explores and transforms the data we will be using to train a multi-class classifier.

**Note:** you will not be able to run this notebook yourself unless you download the [raw data](https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html). This notebook specifically uses the '20_newsgroups' data set. 
Downloading this data set is *not* required for the rest of the notebooks in this series, so feel free to simply read through this notebook without executing the cells. 

This data set consists of 20000 messages taken from 20 Usenet newsgroups.

In the next few cells we read in and print out individual files from the data set:

In [1]:
f = open('./20_newsgroups/alt.atheism/51121', 'r') # 'r' = read
lines = f.read()
print(lines)
f.close()

Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51121 soc.motss:139944 rec.scouting:5318
Newsgroups: alt.atheism,soc.motss,rec.scouting
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!wupost!uunet!newsgate.watson.ibm.com!yktnews.watson.ibm.com!watson!Watson.Ibm.Com!strom
From: strom@Watson.Ibm.Com (Rob Strom)
Subject: Re: [soc.motss, et al.] "Princeton axes matching funds for Boy Scouts"
Sender: @watson.ibm.com
Message-ID: <1993Apr05.180116.43346@watson.ibm.com>
Date: Mon, 05 Apr 93 18:01:16 GMT
Distribution: usa
References: <C47EFs.3q47@austin.ibm.com> <1993Mar22.033150.17345@cbnewsl.cb.att.com> <N4HY.93Apr5120934@harder.ccr-p.ida.org>
Organization: IBM Research
Lines: 15

In article <N4HY.93Apr5120934@harder.ccr-p.ida.org>, n4hy@harder.ccr-p.ida.org (Bob McGwier) writes:

|> [1] HOWEVER, I hate economic terrorism and political correctness
|> worse than I hate this policy.  


|> [2] A more effective approach is to stop

In [2]:
f = open('./20_newsgroups/comp.graphics/37921', 'r') # 'r' = read
lines = f.read()
print(lines)
f.close()

Xref: cantaloupe.srv.cs.cmu.edu alt.3d:2141 comp.graphics:37921
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!gatech!swrinde!zaphod.mps.ohio-state.edu!usc!elroy.jpl.nasa.gov!ames!olivea!uunet!mcsun!fuug!kiae!relcom!newsserv
From: alex@talus.msk.su (Alex Kolesov)
Newsgroups: alt.3d,comp.graphics
Subject: Help on RenderMan language wanted!
Message-ID: <9304051103.AA01274@talus.msk.su>
Date: 5 Apr 93 11:00:50 GMT
Sender: news-service@newcom.kiae.su
Reply-To: alex@talus.msk.su
Organization: unknown
Lines: 17

Hello everybody !

If you are using PIXAR'S RenderMan 3D scene description language for creating 3D worlds, please, help me. 

I'm using RenderMan library on my NeXT but there is no documentation about NeXTSTEP version of RenderMan available. I can create very complicated scenes and render them using surface shaders, 
but I can not bring them to life by applying shadows and reflections.

As far as I understand I have to define environme

In [3]:
f = open('./20_newsgroups/comp.graphics/37930', 'r') # 'r' = read
lines = f.read()
print(lines)
f.close()

Xref: cantaloupe.srv.cs.cmu.edu comp.graphics:37930 comp.unix.aix:23730
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!uunet!mcsun!fuug!tahko.lpr.carel.fi!tahko.lpr.carel.fi!not-for-mail
From: ari@tahko.lpr.carel.fi (Ari Suutari)
Newsgroups: comp.graphics,comp.unix.aix
Subject: Any graphics packages available for AIX ?
Date: 6 Apr 1993 10:00:38 +0300
Organization: Carelcomp Oy
Lines: 24
Message-ID: <1pr9qnINNiag@tahko.lpr.carel.fi>
NNTP-Posting-Host: tahko.lpr.carel.fi
Keywords: gks graphics


	Does anybody know if there are any good 2d-graphics packages
	available for IBM RS/6000 & AIX ? I'm looking for something
	like DEC's GKS or Hewlett-Packards Starbase, both of which
	have reasonably good support for different output devices
	like plotters, terminals, X etc.

	I have tried also xgks from X11 distribution and IBM's implementation
	of Phigs. Both of them work but we require more output devices
	than

If we look at the data above we see some common themes: 

1. It looks like the classification is given in the 'Newsgroups' field in the header. (Note that some messages have multuiple classifications, meaning they were cross-posted). 
    - Note: We can also get this information from the file name.
2. The message itself starts after a double line break. 
3. There is additional information in the 'Keywords' and the 'Subject' lines. 

We will check a few more files to see if they follow the same trends:

In [4]:
f = open('./20_newsgroups/comp.graphics/37916', 'r') # 'r' = read
lines = f.read()
print(lines)
f.close()

Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!gatech!asuvax!cs.utexas.edu!zaphod.mps.ohio-state.edu!saimiri.primate.wisc.edu!usenet.coe.montana.edu!news.u.washington.edu!uw-beaver!cs.ubc.ca!unixg.ubc.ca!kakwa.ucs.ualberta.ca!ersys!joth
From: joth@ersys.edmonton.ab.ca (Joe Tham)
Newsgroups: comp.graphics
Subject: Where can I find SIPP?
Message-ID: <yFXJ2B2w165w@ersys.edmonton.ab.ca>
Date: Mon, 05 Apr 93 14:58:21 MDT
Organization: Edmonton Remote Systems #2, Edmonton, AB, Canada
Lines: 11

        I recently got a file describing a library of rendering routines 
called SIPP (SImple Polygon Processor).  Could anyone tell me where I can 
FTP the source code and which is the newest version around?
        Also, I've never used Renderman so I was wondering if Renderman 
is like SIPP?  ie. a library of rendering routines which one uses to make 
a program that creates the image...

                                        Thanks,  Joe Tham

--
Jo

In [5]:
f = open('./20_newsgroups/misc.forsale/70337', 'r') # 'r' = read
lines = f.read()
print(lines)
f.close()

Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!gatech!howland.reston.ans.net!usc!cs.utexas.edu!qt.cs.utexas.edu!news.Brown.EDU!noc.near.net!bigboote.WPI.EDU!bigwpi.WPI.EDU!kedz
From: kedz@bigwpi.WPI.EDU (John Kedziora)
Newsgroups: misc.forsale
Subject: Motorcycle wanted.
Date: 22 Feb 1993 14:22:51 GMT
Organization: Worcester Polytechnic Institute
Lines: 11
Expires: 5/1/93
Message-ID: <1manjr$ja0@bigboote.WPI.EDU>
NNTP-Posting-Host: bigwpi.wpi.edu

Sender: 
Followup-To:kedz@wpi.wpi.edu 
Distribution: ne
Organization: Worcester Polytechnic Institute
Keywords: 

I am looking for an inexpensive motorcycle, nothing fancy, have to be able to do all maintinence my self. looking in the <$400 range.

if you can help me out, GREAT!, please reply by e-mail.





In [6]:
f = open('./20_newsgroups/misc.forsale/74797', 'r') # 'r' = read
lines = f.read()
print(lines)
f.close()

Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!usenet.coe.montana.edu!decwrl!sun-barr!male.EBay.Sun.COM!dswalker!donald
From: donald@dswalker.EBay.Sun.COM (Don Walker)
Newsgroups: misc.forsale
Subject: Items for SALE
Message-ID: <1ps3s4$6g@male.EBay.Sun.COM>
Date: 6 Apr 93 14:25:08 GMT
Article-I.D.: male.1ps3s4$6g
Reply-To: donald@dswalker.EBay.Sun.COM
Distribution: world
Organization: Sun Microsystems, Inc.
Lines: 19
NNTP-Posting-Host: dswalker.ebay.sun.com


                        ITEMS FOR SALE



1. Howard Miller Clock. It chimes like a grandfather clock. $250

2. Painting- A Tiger in the snow. It is a beautiful painting, the tiger
   looks like it can jump off of the canvas and get you. $200

3. Mens Diamond Ring, size 10 - $500
a. 3 rows of diamonds
b. 18k gold

Call or email me.

Donald Walker
hm 408-263-3709
wk 408-276-3618



## Transforming the data

We want to put the relevant messages and data into a Pandas data frame.

We will do this first by creating a list, then making a data frame from the list. 

For each file we 
1. extract the message by discarding everything before the first line break,
2. extract the 'Subject:' line,
2. extract the classification from the filename. 

In [7]:
import os 

In [8]:
listOfFiles = list()
for (dirpath, dirnames, filenames) in os.walk('./20_newsgroups'):
    listOfFiles += [os.path.join(dirpath, file) for file in filenames]

In [9]:
data=[]
passed=[]
for file in listOfFiles:
    
    f = open(file, 'r') 
    try:
        lines = f.read()
        data.append([lines.split('\n\n', 1)[1], lines.split('\nSubject: ', 1)[1].split('\n', 1)[0], file.split('/')[2]])
    except:
        passed.append(file.split('/')[2:])
        pass
    f.close()

In [10]:
len(data)

19924

In [11]:
### Making a data frame

In [12]:
import pandas as pd

In [13]:
df = pd.DataFrame(data, columns = ['Message', 'Subject', 'Catagory'])

In [14]:
df.sample(10)

Unnamed: 0,Message,Subject,Catagory
12342,\nIn article <1qvq4b$r4t@wampyr.cc.uow.edu.au>...,Re: Need polygon splitting algo...,comp.graphics
7608,"In article <1r9av2$bg6@transfer.stratus.com>, ...","Re: I have seen the lobby, and it is us",sci.crypt
18725,"\n For Sale: 386SX-16Mz, 8 meg RAM!, 12...",386sx for sale,misc.forsale
19706,In article <1r59na$e81@fido.asd.sgi.com> lives...,"Re: After 2000 years, can we say that Christia...",talk.religion.misc
5099,I'm searching for a phonetic TrueType font for...,Searching for a phonetic font,comp.os.ms-windows.misc
15853,I predict that the outcome of the study of wha...,"Re: WACO: Clinton press conference, part 1",talk.politics.guns
11706,I am using X11R5patch23 with the R5-SUNOS5 pat...,Problem with libXmu on SUNOS5.1 and gcc,comp.windows.x
7810,cuffell@spot.Colorado.EDU (Tim Cuffel) writes:...,Re: disk safety measure?,sci.crypt
11873,"Can anyone give me some information, please .....",Writing a Motif widget,comp.windows.x
2661,"I thought I'd share a good experience, too. I...",Re: Good APS experience,comp.sys.mac.hardware


### Splitting the data

We split the data, using 70% as a training set, and the remaining 30% as a testing set. We then save these data frames as parquet files.

In [15]:
train = df.sample(frac=0.7, random_state=504)
test = df.drop(train.index) #everything that isn't in the test set

In [16]:
train.shape

(13947, 3)

In [17]:
test.shape

(5977, 3)

In [18]:
train.to_parquet('data/training.parquet')

In [19]:
test.to_parquet('data/testing.parquet')