Take all 8123 samples in mushrooms.csv with edible and poisonous mushrooms. Draw a dendrogram of a hierarchical clustering by Orange3 software. In the dendrogram, we will select all poisonous samples. In the dendrogram, we can see that all poisonous samples are together in the same groups.
![Image of dendogram](https://raw.githubusercontent.com/kopylovvlad/mushroom_classification/master/h_clust.png)

And now, we can assume how to write simple classifier. Split all samples into train and test subsets. Transform each sample-data to vector.  Take one sample from test subset and find 3 nearest vectors from train subset. If 2 samples from 3 nearest vectors have 'edible' class, we can assume, that sample from test subset is edible too. If 2 samples from 3 nearest vectors have 'poisonous' class, we can expect, that sample from test subset is poisonous too. 

Implementation for first step:
* Open csv-file
* Convert raw csv-data to dataset of vectors
* Get train subset (for example, 100 samples)
* Save train dataset to file

The code in python3.6 below:


In [None]:
from typing import List, Dict, Tuple
import clusters
import csv_helper as csv_h
import os
import argparse
import pickle

parser = argparse.ArgumentParser()
parser.add_argument("-l", '--limit', help='Csv row limit', type=int)
arguments = parser.parse_args()
csv_row_limit: int = arguments.limit or 100  # 8120

print('csv_limit is: %d' % csv_row_limit)

dirname: str = os.path.dirname(os.path.abspath(__file__))
item_names, props, data = csv_h.csv_to_vector(dirname + '/mushrooms.csv')

data = data[:csv_row_limit]
item_names = item_names[:csv_row_limit]
print('We have %d train items' % len(data))

f = open(dirname + '/tmp_data/mushrooms_data_vector1.pickle', 'wb')
pickle.dump((item_names, props, data), f)
f.close()


Implementation for second step:
* Open csv-file
* Convert raw csv-data to dataset of vectors
* Get test subset (for example, other 100 samples)
* Pick samples one by one from test subset and find 3 nearest vectors from train subset
* Check is predicted class eqial to real data (edible or poisonous)

The code in python3.6 below:

In [None]:
from typing import List, Dict
import clusters
import os
import csv
import argparse
import csv_helper as csv_h
import pickle
import sys

#
# prepare data
#

dirname: str = os.path.dirname(os.path.abspath(__file__))
parser = argparse.ArgumentParser()
parser.add_argument("-l", '--limit', help='Csv row limit', type=int)
parser.add_argument("-o", '--offset', help='Offset limit', type=int)
arguments = parser.parse_args()
csv_row_limit: int = arguments.limit or 100  # 8120
offset: int = arguments.offset or 100  # 8120
print('csv_limit is: %d' % csv_row_limit)
print('offset is: %d' % offset)


#
# prepare csv
#

print('Opening csv ... ', end='')
file_path: str = dirname + '/mushrooms.csv'
test_item_names: List[str]
_props: List[str]
test_data: List[List[int]]
test_item_names, _props, test_data = csv_h.csv_to_vector(file_path)
del _props

test_item_names = test_item_names[offset:][:csv_row_limit]
test_data = test_data[offset:][:csv_row_limit]
print('end')

print('We have %d test_items' % len(test_data))

#
# pickling
#

print('Pickling ... ', end='')
f = open(dirname + '/tmp_data/mushrooms_data_vector1.pickle', 'rb')
know_item_names, know_props, know_data = pickle.load(f)
f.close()

print('Train data items: %d' % len(know_data))


def p_e_verict(three_item: List[str]) -> str:
    p_size: int = 0
    e_size: int = 0
    for name in three_item:
        word: str = name[len(name)-1]
        if word == 'p':
            p_size += 1
        elif word == 'e':
            e_size += 1
        else:
            raise BaseException('word is not into [p,e]')

    if p_size > e_size:
        return 'p'
    else:
        return 'e'


#
# processing
#
cassify_data: List[str] = []
for i in range(len(test_data)):
    test_name: str = test_item_names[i]
    test_row = test_data[i]
    three_closest_name: List[str] = []  # list with names
    three_closest_name = clusters.get_three_closest_names(
        test_row,
        know_item_names,
        know_data,
        distance=clusters.tanimoto_coeff
    )
    cassify_data.append(p_e_verict(three_closest_name))


#
# checking
#
stat: Dict[str, int] = {
    'equal': 0,
    'not_equal': 0
}
print('Checking ... ', end='')
for i in range(len(test_data)):
    test_full_name: str = test_item_names[i]
    test_name = test_full_name[len(test_full_name) - 1]
    predict_name: str = cassify_data[i]

    if predict_name == test_name:
        stat['equal'] = stat['equal'] + 1
    else:
        stat['not_equal'] = stat['not_equal'] + 1
print('')

print('Equal is: %d' % stat['equal'])
print('Not equal is: %d' % stat['not_equal'])
divisor: float = (stat['equal'] + stat['not_equal']) / 100

if divisor == 0:
    print('Accuracy is 0')
else:
    print('Accuracy is %f' % (stat['equal'] / divisor))


Let's run the scripts. For train subset, I choose 3500 samples. For test subset I decided to experiment with 3 sets (1000, 3500 and 4600 samples).

Results are:
* For 3500 train samples and 1000 test samples (aspect ratio - 7:2), accuracy is **97.9%**
* For 3500 train samples and 3500 test samples (aspect ratio - 1:1), accuracy is **77.6%**
* For 3500 train samples and 4600 test samples (aspect ratio - 7:9,2), accuracy is **71.5%**

It works. This is my first experiment from scratch and python3.6. Source code is available in github. 

Github link to [source code](https://github.com/kopylovvlad/mushroom_classification)