Write a program that converts the original dataset (http://archive.ics.uci.edu/ml/machine- learning-databases/mushroom/) to a format that fits our NN’s input form. Your program should transform original dataset’s records (filename is agaricus-lepiota.data, provided in the project sub-folder, where each line represents one record with 23 features—see agaricus-lepiota.names for details) into the so-called one-hot binary code form (the first 2 numbers are classes — poisonous or edible, while others represent the remaining features—each feature may have multiple values, and you should make sure that your binary code can represent each of the values). Please take a look at data_representation_example.txt for some details. Then split the converted dataset into 3 files: training.txt for network training, val.txt for validation, and testing.txt for network testing. (You may name these files by yourself, but make sure to modify the code to fit your files). Sample_testing.txt is a sample file showing the exact format that we need.

### Import Libraries

In [12]:
import pandas as pd
import numpy as np

### Data

In [3]:
# Reading Data from the given csv file 
data = pd.read_csv('agaricus-lepiota.data', names = ['classes', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'], na_values = "?")
data

Unnamed: 0,classes,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,...,s,o,o,p,o,o,p,b,c,l
8120,e,x,s,n,f,n,a,c,b,y,...,s,o,o,p,n,o,p,b,v,l
8121,e,f,s,n,f,n,a,c,b,n,...,s,o,o,p,o,o,p,b,c,l
8122,p,k,y,n,f,y,f,c,n,b,...,k,w,w,p,w,o,e,w,v,l


### Exploratory Data Analysis

In [5]:
#description of the data in the DataFrame
data.describe().T

Unnamed: 0,count,unique,top,freq
classes,8124,2,e,4208
cap-shape,8124,6,x,3656
cap-surface,8124,4,y,3244
cap-color,8124,10,n,2284
bruises?,8124,2,f,4748
odor,8124,9,n,3528
gill-attachment,8124,2,f,7914
gill-spacing,8124,2,c,6812
gill-size,8124,2,b,5612
gill-color,8124,12,b,1728


### Checking for null values

In [4]:
data.isnull().sum()

classes                        0
cap-shape                      0
cap-surface                    0
cap-color                      0
bruises?                       0
odor                           0
gill-attachment                0
gill-spacing                   0
gill-size                      0
gill-color                     0
stalk-shape                    0
stalk-root                  2480
stalk-surface-above-ring       0
stalk-surface-below-ring       0
stalk-color-above-ring         0
stalk-color-below-ring         0
veil-type                      0
veil-color                     0
ring-number                    0
ring-type                      0
spore-print-color              0
population                     0
habitat                        0
dtype: int64

#### There are 2480 NaN values present in stalk-root feature it is approximately 30%. Exluding these features altogther will result in decrease in accuracy. Hence handling such values by replacing them with most frequnet value in stalk-root

In [6]:
#checking for which value of stalk-root is most frequent
data['stalk-root'].value_counts()

b    3776
e    1120
c     556
r     192
Name: stalk-root, dtype: int64

In [7]:
#relacing NaN/ ? values with 'b'
data['stalk-root'] = data['stalk-root'].fillna('b')
data['stalk-root'].value_counts()

b    6256
e    1120
c     556
r     192
Name: stalk-root, dtype: int64

In [8]:
#checking again for any null values
data.isnull().sum()

classes                     0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises?                    0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

In [9]:
# Renaming classes p and e as 0 and 1 respectively
data['classes'] =data['classes'].replace('p','0')
data['classes'] =data['classes'].replace('e','1')

In [10]:
# Transforming the categorical variable into a set of binary variables
data = pd.get_dummies(data)

In [11]:
data

Unnamed: 0,classes_0,classes_1,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_f,cap-surface_g,...,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,1,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
1,0,1,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,1,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
4,0,1,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
8120,0,1,0,0,0,0,0,1,0,0,...,0,1,0,0,0,1,0,0,0,0
8121,0,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
8122,1,0,0,0,0,1,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0


### Train-Test-Val split into 90:10:10 ratio

In [13]:
train, validate, test = np.split(data.sample(frac=1, random_state= 42), [int(.8*len(data)), int(.9*len(data))])


### Saving train, test , validaion data into files

In [20]:
train.to_csv('training.txt',index= False, header=False ,encoding='utf-8')
validate.to_csv('val.txt',index= False, header=False, encoding='utf-8')
test.to_csv('testing.txt', index= False,header=False, encoding='utf-8')
