# Working with Mushrooms
Pattern recognition project
- Ronnel Davis
- Ashiq Muhammed
- Vrushabh Jambhulkar
- John V George
- Vipin Santosh

Dataset: https://archive.ics.uci.edu/ml/datasets/Mushroom

## Proposal 
The means of surviving an environment can be anything from starting a fire with a couple of sticks to eating wild edibles. In the wilderness regions, finding edible plant life such as wild mushrooms is often necessary in order to stay alive. The inherent risks of eating the inedible variety are great, with effects ranging from “a mild stomachache to severe physical distress-including vomiting, diarrhea, cramps and loss of coordination.” In the field of botany, no general rule exists for the identification of these dangerous and even life-threatening fungi.

The aim of this project is to determine if such a rule exists using the techniques of machine learning. The focus of this project was to use existing data obtained from the University of California-Irvine machine learning repository to see what kind of rules for avoiding potentially poisonous mushrooms can be learned directly from the fungi. The field guide stipulates no such simple classification rule exists either. The ultimate goal was to determine what the most common features of toxic and edible wild mushrooms are.

### Attribute Information: (classes: edible=e, poisonous=p)
    1. cap-shape:                bell=b,conical=c,convex=x,flat=f,
                                  knobbed=k,sunken=s
    2. cap-surface:              fibrous=f,grooves=g,scaly=y,smooth=s
    3. cap-color:                brown=n,buff=b,cinnamon=c,gray=g,green=r,
                                  pink=p,purple=u,red=e,white=w,yellow=y
    4. bruises?:                 bruises=t,no=f
    5. odor:                     almond=a,anise=l,creosote=c,fishy=y,foul=f,
                                  musty=m,none=n,pungent=p,spicy=s
    6. gill-attachment:          attached=a,descending=d,free=f,notched=n
    7. gill-spacing:             close=c,crowded=w,distant=d
    8. gill-size:                broad=b,narrow=n
    9. gill-color:               black=k,brown=n,buff=b,chocolate=h,gray=g,
                                  green=r,orange=o,pink=p,purple=u,red=e,
                                  white=w,yellow=y
    10. stalk-shape:              enlarging=e,tapering=t
    11. stalk-root:               bulbous=b,club=c,cup=u,equal=e,
                                  rhizomorphs=z,rooted=r,missing=?
    12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
    13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
    14. stalk-color-above-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
                                  pink=p,red=e,white=w,yellow=y
    15. stalk-color-below-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
                                  pink=p,red=e,white=w,yellow=y
    16. veil-type:                partial=p,universal=u
    17. veil-color:               brown=n,orange=o,white=w,yellow=y
    18. ring-number:              none=n,one=o,two=t
    19. ring-type:                cobwebby=c,evanescent=e,flaring=f,large=l,
                                  none=n,pendant=p,sheathing=s,zone=z
    20. spore-print-color:        black=k,brown=n,buff=b,chocolate=h,green=r,
                                  orange=o,purple=u,white=w,yellow=y
    21. population:               abundant=a,clustered=c,numerous=n,
                                  scattered=s,several=v,solitary=y
    22. habitat:                  grasses=g,leaves=l,meadows=m,paths=p,
                                  urban=u,waste=w,woods=d

In [17]:
classes                   = {"e": 0, "p": 1}

cap_shape                 = {"b": 0, "c": 1, "x": 2, "f": 3, "k": 4, "s": 5}
cap_surface               = {"f": 0, "g": 1, "y": 2, "s": 3}
cap_color                 = {"n": 0, "b": 1, "c": 2, "g": 3, "r": 4, "p": 5, "u": 6, "e": 7, "w": 8, "y": 9}
cap_bruise                = {"f": 0, "t": 1}

odor                      = {"a": 0, "l": 1, "c": 2, "y": 3, "f": 4, "m": 5, "n": 6, "p": 7, "s": 8}

gill_attachment           = {"a": 0, "d": 1, "f": 2, "n": 3}
gill_spacing              = {"c": 0, "w": 1, "d": 2}
gill_size                 = {"b": 0, "n": 1}
gill_color                = {"k": 0, "n": 1, "b": 2, "h": 3, "g": 4, "r": 5, "o": 6, "p": 7, "u": 8, "e": 9, "w": 10, "y": 11}

stalk_shape               = {"e": 0, "t": 1}
stalk_surface_around_ring = {"f": 0, "y": 1, "k": 2, "s": 3}
stalk_surface_below_ring  = {"f": 0, "y": 1, "k": 2, "s": 3}
stalk_color_below_ring    = {"n": 0, "b": 1, "c": 2, "g": 3, "o": 4, "p": 5, "e": 6, "w": 7, "y": 8}
stalk_color_below_ring    = {"n": 0, "b": 1, "c": 2, "g": 3, "o": 4, "p": 5, "e": 6, "w": 7, "y": 8}

veil_type                 = {"p": 0, "u": 1}
veil_color                = {"n": 0, "o": 1, "w": 2, "y": 3}

ring_number               = {"n": 0, "o": 1, "t": 2}
ring_type                 = {"c": 0, "e": 1, "f": 2, "l": 3, "n": 4, "p": 5, "s": 6, "z": 7}

spore_print_color         = {"k": 0, "n": 1, "b": 2, "h": 3, "r": 4, "o": 5, "u": 6, "w": 7, "y": 8}

population                = {"a": 0, "c": 1, "n": 2, "s": 3, "v": 4, "y": 5}

habitat                   = {"g": 0, "l": 1, "m": 2, "p": 3, "u": 4, "w": 5, "d": 6}

## The following code is used to read the file, get the data as a string and split it into individual vectors. Since the data is encoded as charachters, it is necessary to covert it into integers before working with the data.

In [18]:
import numpy as np
def getVectorRepresentationOf(data):
    x = [cap_shape[data[1]], cap_surface[data[2]], cap_color[data[3]], cap_bruise[data[4]], 
           odor[data[5]], gill_attachment[data[6]], gill_spacing[data[7]], gill_size[data[8]], gill_color[data[9]],
           stalk_shape[data[10]], stalk_surface_around_ring[data[12]], stalk_surface_below_ring[data[13]], stalk_color_below_ring[data[14]],
           stalk_color_below_ring[data[15]], veil_type[data[16]], veil_color[data[17]], ring_number[data[18]], ring_type[data[19]], spore_print_color[data[20]],
           population[data[21]], habitat[data[22]]]
    y = classes[data[0]]
    return x, y

myfile = open('data.txt', 'r')
data = myfile.read().split("\n")
inputs = []
outputs = []
for i in data:
    arr = i.split(",")
    if(len(arr) == 23):
        x, y = getVectorRepresentationOf(arr)
        inputs.append(x)
        outputs.append(y)
        
inputs = np.asarray(inputs)
outputs = np.asarray(outputs)

## The following code takes the input, splits it into training and test datasets, and finds the principal components of the data

In [19]:
from sklearn.decomposition import PCA as sklearnPCA
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB

for i in range(1, 21):
    sklearn_pca = sklearnPCA(n_components=i)
    inputs_PCA = sklearn_pca.fit_transform(inputs)

    train_proportion = 0.8
    train_test_cut = int(len(inputs_PCA)*train_proportion)

    inputs_train, inputs_test, outputs_train, outputs_test = \
        inputs_PCA[:train_test_cut], \
        inputs_PCA[train_test_cut:], \
        outputs[:train_test_cut], \
        outputs[train_test_cut:]

    clf = GaussianNB()
    clf.fit(inputs_train, outputs_train)
    target_pred = clf.predict(inputs_test)
    accuracy = accuracy_score(outputs_test, target_pred, normalize = True)
    print("Using ", i, " components, accuracy is ", accuracy*100, "%")

Using  1  components, accuracy is  82.52307692307693 %
Using  2  components, accuracy is  68.9846153846154 %
Using  3  components, accuracy is  88.24615384615385 %
Using  4  components, accuracy is  91.87692307692308 %
Using  5  components, accuracy is  87.63076923076923 %
Using  6  components, accuracy is  78.58461538461539 %
Using  7  components, accuracy is  73.29230769230769 %
Using  8  components, accuracy is  73.6 %
Using  9  components, accuracy is  82.83076923076923 %
Using  10  components, accuracy is  86.64615384615385 %
Using  11  components, accuracy is  83.26153846153846 %
Using  12  components, accuracy is  83.6923076923077 %
Using  13  components, accuracy is  84.3076923076923 %
Using  14  components, accuracy is  75.01538461538462 %
Using  15  components, accuracy is  76.24615384615385 %
Using  16  components, accuracy is  82.39999999999999 %
Using  17  components, accuracy is  82.95384615384616 %
Using  18  components, accuracy is  88.73846153846155 %
Using  19  compon

## Conclusion
As we can see, the accuracy is highest when only considering the four attributes that show highest variance. 