Import modules needed for assignment

In [1]:
import pandas as pd
import numpy as np
import sklearn.model_selection
import sklearn.linear_model
from sklearn import metrics

Let's recreate the dataframe from our previous assignment.

In [2]:
mushroom_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data',sep=',', header=None, usecols=[0,3,5], names=['Edible', 'Cap Color', 'Odor'])

mushroom_data.replace(to_replace={'Edible':{'p': 1, 'e': 0}}, inplace=True)
mushroom_data.replace(to_replace={'Cap Color':{'n':0, 'b':1, 'c':2, 'g':3, 'r':4, 'p':5, 'u':6, 'e':7, 'w':8, 'y':9}}, inplace=True)
mushroom_data.replace(to_replace={'Odor':{'a':0, 'l':1, 'c':2, 'y':3, 'f':4, 'm':5, 'n':6, 'p':7, 's':8}}, inplace=True)
mushroom_data.head()

Unnamed: 0,Edible,Cap Color,Odor
0,1,0,7
1,0,9,0
2,0,8,1
3,1,8,7
4,0,3,6


Let's convert odor, cap color and into dummy/indicator variables.

In [3]:
odor = pd.Series(mushroom_data['Odor'])
o = pd.get_dummies(odor)

cap_color = pd.Series(mushroom_data['Cap Color'])
c = pd.get_dummies(cap_color)


We should combine the columns into one single column so that we can analyze the data more.

In [4]:
mushroom_col = pd.concat([o, c, mushroom_data['Edible']], axis=1)
cols = list(mushroom_col.iloc[:, :-1])

Now we'll define a training model to take a deeper look.

In [5]:
X = mushroom_data.iloc[:, :-1].values
Y = mushroom_data.iloc[:, 1].values

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, random_state=1)

We'll use linear regression now to predict the y value with a test variable.  We'll also use  sci-kit learn to predict true and predictive output:

In [6]:
l_reg = sklearn.linear_model.LinearRegression()
l_reg.fit(X_train, Y_train)
Y_pred = l_reg.predict(X_test)
t = [1, 0]
p = [1, 0]

print(sklearn.metrics.mean_absolute_error(t, p))
print(sklearn.metrics.mean_squared_error(t, p))
print(np.sqrt(sklearn.metrics.mean_squared_error(t, p)))

0.0
0.0
0.0




We'll need to calculate root mean squared error for the data to figure out margin of error.

In [7]:
print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))


1.6820267025171296e-14


The root mean squared error is closer to zero, so we can assume the data set can predict edibility accurately.  Let's try odor and see how that fairs in determining edibility.

In [8]:
X = mushroom_col.iloc[:, 0:3].values
Y = mushroom_col.iloc[:, 1].values

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, random_state=1)
l_reg.fit(X_train, Y_train)
Y_pred = l_reg.predict(X_test)

print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

8.982975992785984e-15


Now let's check the cap color to see if the mean squared error is closer to zero than odor was.

In [9]:
X = mushroom_col.iloc[:, 5:10].values
Y = mushroom_col.iloc[:, 1].values

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, random_state=1)
l_reg.fit(X_train, Y_train)
Y_pred = l_reg.predict(X_test)

print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

0.20711573408146186


In terms of mean squared error, odor comes closest to zero making it the clear choice in determining if a mushroom is edible or not.