# Exploratory Data Analysis 

This notebook explores and transforms the multi-class data we will be using in the rest of the notebooks.

**Note:** you will not be able to run this notebook yourself unless you download the [raw data](https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html). This notebook specifically uses the '20_newsgroups' data set and assumes it is saved in a subdirectory of the working directory. 
Downloading this data set is *not* required for the rest of the notebooks in this series, so feel free to simply read through this notebook without executing the cells. 

The data set consists of 20,000 messages taken from 20 Usenet newsgroups. Each message is stored in its own file.

In the next few cells we inspect the contents of some of the files:

In [1]:
f = open('./20_newsgroups/alt.atheism/51121', 'r') 
lines = f.read()
print(lines)
f.close()

Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51121 soc.motss:139944 rec.scouting:5318
Newsgroups: alt.atheism,soc.motss,rec.scouting
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!wupost!uunet!newsgate.watson.ibm.com!yktnews.watson.ibm.com!watson!Watson.Ibm.Com!strom
From: strom@Watson.Ibm.Com (Rob Strom)
Subject: Re: [soc.motss, et al.] "Princeton axes matching funds for Boy Scouts"
Sender: @watson.ibm.com
Message-ID: <1993Apr05.180116.43346@watson.ibm.com>
Date: Mon, 05 Apr 93 18:01:16 GMT
Distribution: usa
References: <C47EFs.3q47@austin.ibm.com> <1993Mar22.033150.17345@cbnewsl.cb.att.com> <N4HY.93Apr5120934@harder.ccr-p.ida.org>
Organization: IBM Research
Lines: 15

In article <N4HY.93Apr5120934@harder.ccr-p.ida.org>, n4hy@harder.ccr-p.ida.org (Bob McGwier) writes:

|> [1] HOWEVER, I hate economic terrorism and political correctness
|> worse than I hate this policy.  


|> [2] A more effective approach is to stop

In [2]:
f = open('./20_newsgroups/comp.graphics/37921', 'r') 
print(lines)
f.close()

Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51121 soc.motss:139944 rec.scouting:5318
Newsgroups: alt.atheism,soc.motss,rec.scouting
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!wupost!uunet!newsgate.watson.ibm.com!yktnews.watson.ibm.com!watson!Watson.Ibm.Com!strom
From: strom@Watson.Ibm.Com (Rob Strom)
Subject: Re: [soc.motss, et al.] "Princeton axes matching funds for Boy Scouts"
Sender: @watson.ibm.com
Message-ID: <1993Apr05.180116.43346@watson.ibm.com>
Date: Mon, 05 Apr 93 18:01:16 GMT
Distribution: usa
References: <C47EFs.3q47@austin.ibm.com> <1993Mar22.033150.17345@cbnewsl.cb.att.com> <N4HY.93Apr5120934@harder.ccr-p.ida.org>
Organization: IBM Research
Lines: 15

In article <N4HY.93Apr5120934@harder.ccr-p.ida.org>, n4hy@harder.ccr-p.ida.org (Bob McGwier) writes:

|> [1] HOWEVER, I hate economic terrorism and political correctness
|> worse than I hate this policy.  


|> [2] A more effective approach is to stop

In [3]:
f = open('./20_newsgroups/comp.graphics/37930', 'r') 
lines = f.read()
print(lines)
f.close()

Xref: cantaloupe.srv.cs.cmu.edu comp.graphics:37930 comp.unix.aix:23730
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!uunet!mcsun!fuug!tahko.lpr.carel.fi!tahko.lpr.carel.fi!not-for-mail
From: ari@tahko.lpr.carel.fi (Ari Suutari)
Newsgroups: comp.graphics,comp.unix.aix
Subject: Any graphics packages available for AIX ?
Date: 6 Apr 1993 10:00:38 +0300
Organization: Carelcomp Oy
Lines: 24
Message-ID: <1pr9qnINNiag@tahko.lpr.carel.fi>
NNTP-Posting-Host: tahko.lpr.carel.fi
Keywords: gks graphics


	Does anybody know if there are any good 2d-graphics packages
	available for IBM RS/6000 & AIX ? I'm looking for something
	like DEC's GKS or Hewlett-Packards Starbase, both of which
	have reasonably good support for different output devices
	like plotters, terminals, X etc.

	I have tried also xgks from X11 distribution and IBM's implementation
	of Phigs. Both of them work but we require more output devices
	than

If we look at the data above we see some structure:

1. The file contains some information in headers.
2. The message itself starts after a double line break. 
3. There is information in the 'Keywords' and the 'Subject' lines of the header which may be informative about the classification of the message. 

We will check a few more files to see if they exhibit the same structure:

In [5]:
f = open('./20_newsgroups/comp.graphics/37916', 'r') 
lines = f.read()
print(lines)
f.close()

Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!gatech!asuvax!cs.utexas.edu!zaphod.mps.ohio-state.edu!saimiri.primate.wisc.edu!usenet.coe.montana.edu!news.u.washington.edu!uw-beaver!cs.ubc.ca!unixg.ubc.ca!kakwa.ucs.ualberta.ca!ersys!joth
From: joth@ersys.edmonton.ab.ca (Joe Tham)
Newsgroups: comp.graphics
Subject: Where can I find SIPP?
Message-ID: <yFXJ2B2w165w@ersys.edmonton.ab.ca>
Date: Mon, 05 Apr 93 14:58:21 MDT
Organization: Edmonton Remote Systems #2, Edmonton, AB, Canada
Lines: 11

        I recently got a file describing a library of rendering routines 
called SIPP (SImple Polygon Processor).  Could anyone tell me where I can 
FTP the source code and which is the newest version around?
        Also, I've never used Renderman so I was wondering if Renderman 
is like SIPP?  ie. a library of rendering routines which one uses to make 
a program that creates the image...

                                        Thanks,  Joe Tham

--
Jo

In [6]:
f = open('./20_newsgroups/misc.forsale/70337', 'r') 
lines = f.read()
print(lines)
f.close()

Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!gatech!howland.reston.ans.net!usc!cs.utexas.edu!qt.cs.utexas.edu!news.Brown.EDU!noc.near.net!bigboote.WPI.EDU!bigwpi.WPI.EDU!kedz
From: kedz@bigwpi.WPI.EDU (John Kedziora)
Newsgroups: misc.forsale
Subject: Motorcycle wanted.
Date: 22 Feb 1993 14:22:51 GMT
Organization: Worcester Polytechnic Institute
Lines: 11
Expires: 5/1/93
Message-ID: <1manjr$ja0@bigboote.WPI.EDU>
NNTP-Posting-Host: bigwpi.wpi.edu

Sender: 
Followup-To:kedz@wpi.wpi.edu 
Distribution: ne
Organization: Worcester Polytechnic Institute
Keywords: 

I am looking for an inexpensive motorcycle, nothing fancy, have to be able to do all maintinence my self. looking in the <$400 range.

if you can help me out, GREAT!, please reply by e-mail.





In [7]:
f = open('./20_newsgroups/misc.forsale/74797', 'r') 
lines = f.read()
print(lines)
f.close()

Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!usenet.coe.montana.edu!decwrl!sun-barr!male.EBay.Sun.COM!dswalker!donald
From: donald@dswalker.EBay.Sun.COM (Don Walker)
Newsgroups: misc.forsale
Subject: Items for SALE
Message-ID: <1ps3s4$6g@male.EBay.Sun.COM>
Date: 6 Apr 93 14:25:08 GMT
Article-I.D.: male.1ps3s4$6g
Reply-To: donald@dswalker.EBay.Sun.COM
Distribution: world
Organization: Sun Microsystems, Inc.
Lines: 19
NNTP-Posting-Host: dswalker.ebay.sun.com


                        ITEMS FOR SALE



1. Howard Miller Clock. It chimes like a grandfather clock. $250

2. Painting- A Tiger in the snow. It is a beautiful painting, the tiger
   looks like it can jump off of the canvas and get you. $200

3. Mens Diamond Ring, size 10 - $500
a. 3 rows of diamonds
b. 18k gold

Call or email me.

Donald Walker
hm 408-263-3709
wk 408-276-3618



## Transforming the data

We want to store the data in `pandas` `DataFrame`, with each message making up a row of the DataFrame. 

We will do this by first creating a list containing this information, then we will make a DataFrame from the list. 

For each file we:
1. Extract the message by discarding everything before the first double line break,
2. Extract the `Subject:` line,
2. Extract the classification of the message from the filename. 

In [8]:
import os 

In [9]:
listOfFiles = list()
for (dirpath, dirnames, filenames) in os.walk('./20_newsgroups'):
    listOfFiles += [os.path.join(dirpath, file) for file in filenames]

In [10]:
data=[]
passed=[]
for file in listOfFiles:
    
    f = open(file, 'r') 
    try:
        lines = f.read()
        data.append([lines.split('\n\n', 1)[1], lines.split('\nSubject: ', 1)[1].split('\n', 1)[0], file.split('/')[2]])
    except:
        passed.append(file.split('/')[2:])
        pass
    f.close()

In [11]:
len(data)

19924

We can look at a slice of the list like so:

In [17]:
data[0:10]

[['In article <1993Apr14.125813.21737@ncsu.edu> hernlem@chess.ncsu.edu (Brad Hernlem) writes:\n\n   Lebanese resistance forces detonated a bomb under an Israeli occupation\n   patrol in Lebanese territory two days ago. Three soldiers were killed and\n   two wounded. In "retaliation", Israeli and Israeli-backed forces wounded\n   8 civilians by bombarding several Lebanese villages. Ironically, the Israeli\n   government justifies its occupation in Lebanon by claiming that it is \n   necessary to prevent such bombardments of Israeli villages!!\n\n   Congratulations to the brave men of the Lebanese resistance! With every\n   Israeli son that you place in the grave you are underlining the moral\n   bankruptcy of Israel\'s occupation and drawing attention to the Israeli\n   government\'s policy of reckless disregard for civilian life.\n\n   Brad Hernlem (hernlem@chess.ncsu.EDU)\n\nVery nice. Three people are murdered, and Bradly is overjoyed. When I\nhear about deaths in the middle east, be

### Making a DataFrame

We transform the lit into a [Pandas](https://pandas.pydata.org) DataFrame as many machine learning models from python packages accept Pandas DataFrames as input. 

In [18]:
import pandas as pd

In [19]:
df = pd.DataFrame(data, columns = ['Message', 'Subject', 'Category'])

In [20]:
df.sample(10)

Unnamed: 0,Message,Subject,Category
15656,>What does this <censored> from NORWAY think h...,Re: Change of name ??,talk.politics.guns
5690,rsrodger@wam.umd.edu (Yamanari) writes:\n>\tI'...,Re: Challenge to Microsoft supporters.,comp.os.ms-windows.misc
7149,Does anyone know of a non-word password genera...,Non-word password generator,sci.crypt
16568,\n\nIn article <1993Apr28.002214.16544@Princet...,Re: temperature of the dark sky,sci.space
12164,"Hi Steve,\n\nAs the author of Multiverse, I fe...",Re: Virtual Reality for X on the CHEAP!,comp.graphics
16343,"In article <1993Apr18.014305.28536@sfu.ca>, Le...",Re: Orion drive in vacuum -- how?,sci.space
18169,"Greetings netters,\n\tI have the following ite...","CD player, 3.5"" 1.44mg floppy, 360K floppy, RC...",misc.forsale
12298,"Hi,\nhas anyone more info about the XGA-2 chip...",XGA-2 info?,comp.graphics
2078,What is the maximum rate of the 6882 FPU that ...,How fast is M6775 LL/A (Apple FPU)?,comp.sys.mac.hardware
3048,"I am doing research on atheism, part of which ...",Atheism survey,alt.atheism


For simplicity, in the subsequent notebooks we want to treat the `Message` and the `Subject` as being part of one long piece of text. As such, we save time and computation by combining those DataFrame columns now:

In [21]:
df["Text"]=df["Message"]+df["Subject"]
df = df.drop(columns = ["Message", "Subject"])

In [22]:
df.sample(10)

Unnamed: 0,Category,Text
19052,talk.religion.misc,In article <20APR199301460499@utarlg.uta.edu> ...
7252,sci.crypt,In article <strnlghtC5t4o3.K5p@netcom.com> str...
13176,comp.sys.ibm.pc.hardware,\nIn article <1993Apr23.142720.25002@spartan.a...
12653,comp.graphics,In article <1993May1.092058.1@aurora.alaska.ed...
16069,sci.space,Question: what is the power spectrum of the bu...
19305,talk.religion.misc,In article <79615@cup.portal.com> Thyagi@cup.p...
10658,rec.motorcycles,In article <1r941o$3tu@menudo.uh.edu> inde7wv@...
14798,sci.electronics,"In article <C5qsBF.IEK@ms.uky.edu>, billq@ms.u..."
1210,rec.autos,Hi! This is my first time to post on this news...
9218,talk.politics.misc,\nIn article <SKUKRETI.147.733811021@CHEMICAL....


### Splitting the data

We split the data, using 70% as a training set, and the remaining 30% as a testing set. We save these data frames as parquet files.

In [16]:
train = df.sample(frac=0.7, random_state=504)
test = df.drop(train.index) #everything that isn't in the test set

In [17]:
train.shape

(13947, 2)

In [18]:
test.shape

(5977, 2)

In [19]:
train.to_parquet('data/training.parquet')

In [20]:
test.to_parquet('data/testing.parquet')