<h1><center>Association Rules And Classification:Mushrooms Dataset</center></h1>

<center><img src="https://www.uncovercolorado.com/wp-content/uploads/2020/04/Amanita-Muscaria-CO-1600x800-1.jpg"></center>

# Introduction
A mushroom or toadstool is the fleshy, spore-bearing fruiting body of a fungus, typically produced above ground, on soil, or on its food source.

The standard for the name "mushroom" is the cultivated white button mushroom, Agaricus bisporus; hence the word "mushroom" is most often applied to those fungi (Basidiomycota, Agaricomycetes) that have a stem (stipe), a cap (pileus), and gills (lamellae, sing. lamella) on the underside of the cap. "Mushroom" also describes a variety of other gilled fungi, with or without stems, therefore the term is used to describe the fleshy fruiting bodies of some Ascomycota. These gills produce microscopic spores that help the fungus spread across the ground or its occupant surface.

Forms deviating from the standard morphology usually have more specific names, such as "bolete", "puffball", "stinkhorn", and "morel", and gilled mushrooms themselves are often called "agarics" in reference to their similarity to Agaricus or their order Agaricales. By extension, the term "mushroom" can also refer to either the entire fungus when in culture, the thallus (called a mycelium) of species forming the fruiting bodies called mushrooms, or the species itself.
Context
Although this dataset was originally contributed to the UCI Machine Learning repository nearly 30 years ago, mushroom hunting (otherwise known as "shrooming") is enjoying new peaks in popularity. Learn which features spell certain death and which are most palatable in this dataset of mushroom characteristics. And how certain can your model be?

## Content
This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.

* Time period: Donated to UCI ML 27 April 1987
## Inspiration
* What types of machine learning models perform best on this dataset?

* Which features are most indicative of a poisonous mushroom?

## Acknowledgements
This dataset was originally donated to the UCI Machine Learning repository. You can learn more about past research using the data here.

## About this file

* Attribute Information: (classes: edible=e, poisonous=p)

* cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s

* cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s

* cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y

* bruises: bruises=t,no=f

* odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

* gill-attachment: attached=a,descending=d,free=f,notched=n

* gill-spacing: close=c,crowded=w,distant=d

* gill-size: broad=b,narrow=n

* gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y

* stalk-shape: enlarging=e,tapering=t

* stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?

* stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s

* stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s

* stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

* stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

* veil-type: partial=p,universal=u

* veil-color: brown=n,orange=o,white=w,yellow=y

* ring-number: none=n,one=o,two=t

* ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z

* spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y

* population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y

* habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

## Contents of notebook

1. [Import data and python packages](#t1.)
    * Import packages
    * Import data
    * Data shape and info
2. [Data visualization and Analysis](#t2.)
    * Pie Chart
    * Count plots
    * Query Charts
3. [Classification](#t3.)

    3.1 [Split data for train and test](#t3.1)
    
    3.2 [Functions for models](#t3.2)
    
    3.3 [Models](#t3.3)
      * MLP Classifier
      * Random Forest Classifier
      * Gradient Boosting Classifier
      * XGB Classifier
      * Cross Validation Score
4. [Association Rules](#t4.)

      4.1[Apriori Algorithm](#t4.1)

<a id="t1."></a>
# 1. Import data and python packages

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

from xgboost import XGBClassifier,XGBRFClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

In [None]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [None]:
sns.set_style('whitegrid')

In [None]:
df = pd.read_csv('/kaggle/input/mushroom-classification/mushrooms.csv')
df['class'].replace(['e','p'],['Edible','Poisonous'],inplace=True)
df.head()

In [None]:
df.shape

In [None]:
df.info()

<a id="t2."></a>
# 2. Data visualization and Analysis

In [None]:
labels = df['class'].value_counts().index
sizes = df['class'].value_counts().values

plt.figure(figsize=(10,5))
plt.pie(x=sizes,autopct='%1.1f%%',explode=(0.1,0),shadow=True, textprops={'color':"gray"}, 
startangle=90,colors=["teal","darkkhaki"],frame=True,pctdistance=1.2,labeldistance=0)
plt.axis('equal')
plt.legend(labels)
plt.title("Classes".upper(),fontsize=20)
plt.xticks([])
plt.yticks([])
plt.show()
print('Figure 1: Percentages of Mushroom Classes')

The number of poisonous and edible mushrooms is almost half.

In [None]:
plt.figure(figsize=(10,28))
for i,j in zip(df.iloc[:,1:].columns,range(1,23)):
    plt.subplot(11,2,j)
    sns.countplot(x=i, data=df, palette="twilight", edgecolor="black")
plt.tight_layout()
plt.show()

print('Figure 2: Counter Cards for Attributes')

Which features are most indicative of a poisonous mushroom? Must use query chart for answer to the question.

In [None]:
plt.figure(figsize=(15,40))
for i,j in zip(df.iloc[:,1:].columns,range(1,23)):
    plt.subplot(11,2,j)
    df.groupby(i)['class'].value_counts().plot(kind="barh",edgecolor="black",color="teal")
    plt.xlabel("Value Counts")
plt.tight_layout()
plt.show()
print('Figure 3: Query Cards for Attributes')

<a id="t3."></a>
# 3. Classification

<a id="t3.1"></a>
## 3.1 Split data for train and test

In [None]:
X = df.drop('class', axis=1)
y = df['class']

X_encoded = pd.get_dummies(df,prefix_sep="_")
y_encoded = LabelEncoder().fit_transform(y)

X_scaled = StandardScaler().fit_transform(X_encoded)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_encoded, test_size=0.30, random_state=38)
classes = ['p','e']

<a id="t3.2"></a>
## 3.2 Functions for models

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

from yellowbrick.classifier import ROCAUC,ConfusionMatrix
from sklearn.metrics import accuracy_score

def Performance(model):
    global X_train,y_train,X_test,X_train,classes
    
    print("REPORT:")
    print(classification_report(y_test,model.predict(X_test)))
    
    visualizer = ROCAUC(model, classes=classes)
    visualizer.fit(X_train, y_train)      
    visualizer.score(X_test, y_test)        
    visualizer.show();

    plt.figure(figsize=(3,3))
    cm = ConfusionMatrix(model, classes=classes)
    cm.fit(X_train, y_train)
    cm.score(X_test, y_test)
    plt.xticks(rotation=0)
    cm.show();

In [None]:
from sklearn.model_selection import cross_val_score
def CrossValidationScore(model_list):
    global X_scaled,y_encoded
    
    mean_cross_val_score = []
    model_name           = []
    
    for model in model_list:
        model_name.append(type(model).__name__)
        
    for i in model_list:
        scores = cross_val_score(i, X_scaled, y_encoded, cv=5)
        mean_cross_val_score.append(scores.mean())
        
    cvs = pd.DataFrame({"Model Name":model_name,"CVS":mean_cross_val_score})
    return cvs.style.background_gradient("Greens")

<a id="t3.3"></a>
## 3.3 Models

In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(64,128,64),activation="relu",max_iter=500,solver="adam")
mlp.fit(X_train,y_train)

Performance(mlp)

In [None]:
rfc = RandomForestClassifier(n_estimators=150)
rfc.fit(X_train,y_train)

Performance(rfc)

In [None]:
gbc = GradientBoostingClassifier()
gbc.fit(X_train,y_train)

Performance(gbc)

In [None]:
xrfc = XGBRFClassifier()
xrfc.fit(X_train,y_train)

Performance(xrfc)

In [None]:
xgbc = XGBClassifier()
xgbc.fit(X_train,y_train)

Performance(xgbc)

In [None]:
model_list = [mlp, gbc, rfc, xrfc, xgbc]
CrossValidationScore(model_list)

<a id="t4."></a>
## 4. Association Rules

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.

Based on the concept of strong rules, Rakesh Agrawal, Tomasz ImieliÅ„ski and Arun Swami introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule {onion,potatoes}=>{burger} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements.

In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, continuous production, and bioinformatics. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions.

<a id="t4.1"></a>
## 4.1 Apriori Algorithm

The Apriori algorithm was proposed by Agrawal and Srikant in 1994. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation or IP addresses). Other algorithms are designed for finding association rules in data having no transactions (Winepi and Minepi), or having no timestamps (DNA sequencing). Each transaction is seen as a set of items (an itemset). Given a threshold C, the Apriori algorithm identifies the item sets which are subsets of at least C transactions in the database.

Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.

Apriori uses breadth-first search and a Hash tree structure to count candidate item sets efficiently. It generates candidate item sets of length k from item sets of length k-1. Then it prunes the candidates which have an infrequent sub pattern. According to the downward closure lemma, the candidate set contains all frequent k-length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates.
<center><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/549fa6a5f46897d137b5d704ef7f30b6ba36d4de"></center>

In [None]:
df_ap = pd.get_dummies(df,prefix_sep="_")
df_ap.head()

In [None]:
df1 =  apriori(df_ap, min_support=0.80, use_colnames = True, verbose=1)
df1.style.background_gradient("Greens")

In [None]:
association_rules(df1, metric = "support", min_threshold = 0.9).style.background_gradient("gist_earth_r")