# Exploratory Data Analysis 

This notebook explores and transforms the data we will be using to train a multi-class classifier.

**Note:** you will not be able to run this notebook yourself unless you download the [raw data](https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html). This notebook specifically uses the '20_newsgroups' data set. 
Downloading this data set is *not* required for the rest of the notebooks in this series, so feel free to simply read through this notebook without executing the cells. 

This data set consists of 20000 messages taken from 20 Usenet newsgroups.

In the next few cells we inspect the contents of some of the files.

In [1]:
f = open('./20_newsgroups/alt.atheism/51121', 'r') 
lines = f.read()
print(lines)
f.close()

In [2]:
f = open('./20_newsgroups/comp.graphics/37921', 'r') 
print(lines)
f.close()

In [3]:
f = open('./20_newsgroups/comp.graphics/37930', 'r') 
lines = f.read()
print(lines)
f.close()

If we look at the data above we see some common themes: 

1. The classification is given in the 'Newsgroups' field, found in the header. (Note that some messages have multiple classifications, meaning they were cross-posted). 
    - We can also extract the classification from the file name.
2. The message itself starts after a double line break. 
3. There is additional information in the 'Keywords' and the 'Subject' lines. 

We will check a few more files to see if they exhibit the same structure:

In [4]:
f = open('./20_newsgroups/comp.graphics/37916', 'r') 
lines = f.read()
print(lines)
f.close()

In [5]:
f = open('./20_newsgroups/misc.forsale/70337', 'r') 
lines = f.read()
print(lines)
f.close()

In [6]:
f = open('./20_newsgroups/misc.forsale/74797', 'r') 
lines = f.read()
print(lines)
f.close()

## Transforming the data

We want to store the relevant information and data in a `pandas` `DataFrame`.

We will do this by first creating a list containing this information, then we will make a DataFrame from the list. 

For each file we 
1. extract the message by discarding everything before the first line break,
2. extract the `Subject:` line,
2. extract the classification from the filename. 

In [7]:
import os 

In [8]:
listOfFiles = list()
for (dirpath, dirnames, filenames) in os.walk('./20_newsgroups'):
    listOfFiles += [os.path.join(dirpath, file) for file in filenames]

In [9]:
data=[]
passed=[]
for file in listOfFiles:
    
    f = open(file, 'r') 
    try:
        lines = f.read()
        data.append([lines.split('\n\n', 1)[1], lines.split('\nSubject: ', 1)[1].split('\n', 1)[0], file.split('/')[2]])
    except:
        passed.append(file.split('/')[2:])
        pass
    f.close()

In [10]:
len(data)

19924

### Making a data frame

In [11]:
import pandas as pd

In [12]:
df = pd.DataFrame(data, columns = ['Message', 'Subject', 'Category'])

In [13]:
df.sample(10)

Unnamed: 0,Message,Subject,Category
12348,\n I was skimming through a few gophers a...,.GIFs on a Tek401x ??,comp.graphics
8366,\nIn article <1993Apr30.202808.19204@ux1.cts.e...,Re: Antihistamine for sleep aid,sci.med
17789,[This is a response to a request for a Biblica...,Re: Satan kicked out of heaven: Biblical?,soc.religion.christian
3372,"In <sfnNTrC00WBO43LRUK@andrew.cmu.edu> ""David ...","Re: After 2000 years, can we say that Christia...",alt.atheism
3496,In article <66019@mimsy.umd.edu>\nmangoe@cs.um...,Ontology (was: Benediktine Metaphysics),alt.atheism
9217,In article <1993Apr21.230622.6138@gn.ecn.purdu...,Re: Who's next? Mormons and Jews?,talk.politics.misc
10412,egreen@east.sun.com (Ed Green - Pixel Cruncher...,Re: Countersteering_FAQ please post,rec.motorcycles
9866,V2110A@VM.TEMPLE.EDU (Richard Hoenes) writes:\...,Re: Waco Investigation Paranoia,talk.politics.misc
1891,I have manual transmission 5 speed. It difficu...,Manual Xmission-Advice needed...,rec.autos
9884,In article <C5sCqI.4By@apollo.hp.com> goykhman...,Re: A Message for you Mr. President: How do yo...,talk.politics.misc


In the analysis and modeling in the next notebooks we want to treat the Message and the Subject in the same way. As such, we save time and computation by combining those data frame columns now:

In [14]:
df["Text"]=df["Message"]+df["Subject"]
df = df.drop(columns = ["Message", "Subject"])

In [15]:
df.sample(10)

Unnamed: 0,Category,Text
10386,rec.motorcycles,In regards ot some of the posts concerning bia...
6290,rec.sport.hockey,In article <AfnKOVK00UhB01RDtJ@andrew.cmu.edu>...
13116,comp.sys.ibm.pc.hardware,"Hello,\n\n I have a Diamond Stealth VRAM car..."
15577,talk.politics.guns,In article <C6548v.JHA@noose.ecn.purdue.edu> g...
10706,rec.motorcycles,\nIn article <1993Apr19.154020.24818@i88.isc.c...
17693,soc.religion.christian,In <Apr.23.02.55.31.1993.3123@geneva.rutgers.e...
1507,rec.autos,In article <1993Apr14.143750.120204@marshall.w...
4696,rec.sport.baseball,paula@koufax.cv.hp.com (Paul Andresen) writes:...
11166,comp.windows.x,Hi\n\nCan someone please give me some pointers...
17301,soc.religion.christian,"In a previous article, 18669@bach.udel.edu (St..."


### Splitting the data

We split the data, using 70% as a training set, and the remaining 30% as a testing set. We then save these data frames as parquet files.

In [16]:
train = df.sample(frac=0.7, random_state=504)
test = df.drop(train.index) #everything that isn't in the test set

In [17]:
train.shape

(13947, 2)

In [18]:
test.shape

(5977, 2)

In [19]:
train.to_parquet('data/training.parquet')

In [20]:
test.to_parquet('data/testing.parquet')