# Exploratory Data Analysis 

This notebook explores and transforms the data we will be using to train a multi-class classifier.

**Note:** you will not be able to run this notebook yourself unless you download the [raw data](https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html). This notebook specifically uses the '20_newsgroups' data set. 
Downloading this data set is *not* required for the rest of the notebooks in this series, so feel free to simply read through this notebook without executing the cells. 

This data set consists of 20000 messages taken from 20 Usenet newsgroups.

In the next few cells we read in and print out individual files from the data set:

In [1]:
f = open('./20_newsgroups/alt.atheism/51121', 'r') # 'r' = read
lines = f.read()
print(lines)
f.close()

In [2]:
f = open('./20_newsgroups/comp.graphics/37921', 'r') # 'r' = read
lines = f.read()
print(lines)
f.close()

In [3]:
f = open('./20_newsgroups/comp.graphics/37930', 'r') # 'r' = read
lines = f.read()
print(lines)
f.close()

If we look at the data above we see some common themes: 

1. It looks like the classification is given in the 'Newsgroups' field in the header. (Note that some messages have multuiple classifications, meaning they were cross-posted). 
    - Note: We can also get this information from the file name.
2. The message itself starts after a double line break. 
3. There is additional information in the 'Keywords' and the 'Subject' lines. 

We will check a few more files to see if they follow the same trends:

In [4]:
f = open('./20_newsgroups/comp.graphics/37916', 'r') # 'r' = read
lines = f.read()
print(lines)
f.close()

In [5]:
f = open('./20_newsgroups/misc.forsale/70337', 'r') # 'r' = read
lines = f.read()
print(lines)
f.close()

In [6]:
f = open('./20_newsgroups/misc.forsale/74797', 'r') # 'r' = read
lines = f.read()
print(lines)
f.close()

Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!usenet.coe.montana.edu!decwrl!sun-barr!male.EBay.Sun.COM!dswalker!donald
From: donald@dswalker.EBay.Sun.COM (Don Walker)
Newsgroups: misc.forsale
Subject: Items for SALE
Message-ID: <1ps3s4$6g@male.EBay.Sun.COM>
Date: 6 Apr 93 14:25:08 GMT
Article-I.D.: male.1ps3s4$6g
Reply-To: donald@dswalker.EBay.Sun.COM
Distribution: world
Organization: Sun Microsystems, Inc.
Lines: 19
NNTP-Posting-Host: dswalker.ebay.sun.com


                        ITEMS FOR SALE



1. Howard Miller Clock. It chimes like a grandfather clock. $250

2. Painting- A Tiger in the snow. It is a beautiful painting, the tiger
   looks like it can jump off of the canvas and get you. $200

3. Mens Diamond Ring, size 10 - $500
a. 3 rows of diamonds
b. 18k gold

Call or email me.

Donald Walker
hm 408-263-3709
wk 408-276-3618



## Transforming the data

We want to put the relevant messages and data into a Pandas data frame.

We will do this first by creating a list, then making a data frame from the list. 

For each file we 
1. extract the message by discarding everything before the first line break,
2. extract the 'Subject:' line,
2. extract the classification from the filename. 

In [7]:
import os 

In [8]:
listOfFiles = list()
for (dirpath, dirnames, filenames) in os.walk('./20_newsgroups'):
    listOfFiles += [os.path.join(dirpath, file) for file in filenames]

In [9]:
data=[]
passed=[]
for file in listOfFiles:
    
    f = open(file, 'r') 
    try:
        lines = f.read()
        data.append([lines.split('\n\n', 1)[1], lines.split('\nSubject: ', 1)[1].split('\n', 1)[0], file.split('/')[2]])
    except:
        passed.append(file.split('/')[2:])
        pass
    f.close()

In [10]:
len(data)

19924

In [11]:
### Making a data frame

In [12]:
import pandas as pd

In [13]:
df = pd.DataFrame(data, columns = ['Message', 'Subject', 'Category'])

In [14]:
df.sample(10)

Unnamed: 0,Message,Subject,Category
162,\n Barf (JS) spewed forth:\n\n> I do (did) ...,Re: The U.S. Holocaust Memorial Museum: A Cost...,talk.politics.mideast
1951,ajc@philabs.philips.com (Alec Cameron) writes:...,Re: Moving sale,rec.autos
4064,What position does Mike Lansing play? I ca...,Montreal Question.......,rec.sport.baseball
13479,"\nHi all,\n \n I'm looking for some info re...",Otronics Attache luggable info needed,comp.sys.ibm.pc.hardware
1615,boyle@cactus.org (Craig Boyle) writes:\n>The q...,Re: Too fast,rec.autos
4863,Overall (career)\n1.\tDon Mattingly\n2.\tDon M...,RE:Re:ALL-TIME BEST PLAYERS,rec.sport.baseball
10375,In article <arturo.735339956@infmx> arturo@inf...,Re: Good Reasons to Wave at each other,rec.motorcycles
9119,In article <C513wJ.75y@encore.com> rcollins@ns...,Re: Top Ten Reasons Not to Aid Russians,talk.politics.misc
7721,In article <1qsvfcINNq9v@dns1.NMSU.Edu> amolit...,Re: What the clipper nay-sayers sound like to me.,sci.crypt
189,-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+...,"PUBLIC HEARINGS on Ballot Access, Vote Fraud a...",talk.politics.mideast


In [15]:
df["Text"]=df["Message"]+df["Subject"]
df = df.drop(columns = ["Message", "Subject"])

In [16]:
df.sample(10)

Unnamed: 0,Category,Text
18431,misc.forsale,"""Bare"" = case, a power supply, and a motherboa..."
19716,talk.religion.misc,In article <1qh4m5INN2pu@ctron-news.ctron.com>...
2446,comp.sys.mac.hardware,EC>It was called the Mac XL when Sculley came ...
18858,misc.forsale,Misc. Items for sale:\n\n\nMount Plate: Sony ...
19407,talk.religion.misc,kempmp@phoenix.oulu.fi (Petri Pihko) writes:\n...
19061,talk.religion.misc,xcpslf@oryx.com (stephen l favor) writes:\n: :...
3558,alt.atheism,">>>>> On 25 Apr 93 23:26:20 GMT, bobbe@vice (R..."
10997,comp.windows.x,\n\nWill there be no chance to get the Author ...
1630,rec.autos,daubendr@NeXTwork.Rose-Hulman.Edu (Darren R Da...
5047,comp.os.ms-windows.misc,"\nHello everybody,\n\nI am searching for (busi..."


### Splitting the data

We split the data, using 70% as a training set, and the remaining 30% as a testing set. We then save these data frames as parquet files.

In [17]:
train = df.sample(frac=0.7, random_state=504)
test = df.drop(train.index) #everything that isn't in the test set

In [18]:
train.shape

(13947, 2)

In [19]:
test.shape

(5977, 2)

In [20]:
train.to_parquet('data/training.parquet')

In [21]:
test.to_parquet('data/testing.parquet')