In [None]:
#Basic Data analysis and munging (if needed)
import pandas as pd
import numpy as np

#sklearn and all the things I used.
from sklearn.preprocessing import LabelEncoder 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression

#Seaborn/matplotlib to visualize stuff.
import seaborn as sns
from matplotlib import pyplot as plt
#Changing these settings allows us to view the entirety of the collumns avoiding that "..." in the middle of the df prints.
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

#Data Set being used is the Mushroom Data set from UCI ML
#Found here https://www.kaggle.com/uciml/mushroom-classification

df = pd.read_csv("../input/mushroom-classification/mushrooms.csv")

df.head()

Quickly Checking Data to see what I'm going to work with (it is all categorical)

In [None]:
df.info()


I want all of these non-null object Types to be integers so the Random Forests and Logistic Regression models can actually work with the Data.
So I used Label Encoding from sklearn.preprocess

In [None]:
le = LabelEncoder()

for column in df:
    df[column] = le.fit_transform(df[column])
    
df.info()

In [None]:

df.head()


Now looking at this data, everything is numerical classes are 0 or 1 Poisonous or safe, there are either bruises or no bruises, the higher the numbers we see here the more possible answers there are to a given question(for the random forest model)

In [None]:

df.describe()

The Description is nice to see the max, which NOW for our purposes shows the total number of unique values, our unique features can have.
Like I mentioned above, the "class" max is 1 (1 or 0) Cap-colo has 9 unique inputs for our a Random forest to work through.

No I'll get the data ready to actually be worked with, creating a X dataset and their y features dataset. Dropping duplicates before hand, with a total of 8124 total entities I figured a 30% Train Test Split was ok(?)

In [None]:
# I Create an X variable here purely because it's easier for me to recall what I was doing when I go back and look at this.
# When it comes to classification in datasets when we split this dataset having duplicates will likely lead to an 
# Unreasonably high accuracy.
# We can create our X and remove any duplicates if there were any.
X = df.drop_duplicates(keep='first')

#the classification "feature" and our final Dataset to Split
y = X['class']
X = X.drop(['class'], axis=1)
#I used a random state to test the random forest classifier parameters.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50)

Here is the code I used to create a few different types of Random Forest models, the number of Trees as well as the Depth of the tree is altered in each test, Depth increments by 1 until 9 and Tree count increments by 5 each time. 

The criterion remained "gini" since I had tested Entropy on this dataset before and it was performing worse.
I had a random_state here to get the same results as I was testing things out saving the confusion matrix from each models predictions etc.

In [None]:
# n_estimators is the number of trees we will use
# Max_depth determins the amount of features to be used in classification
# Once again using a random state to keep the same sets, I could alter criterion as well.
# The random Forest
forest_trials = []
tree_count = [x*5 for x in range(21) if x*5 >= 5]
for n in tree_count:
    for depth in range(1,10):
        forest = RandomForestClassifier(n_estimators=n, max_depth=depth , random_state=50, criterion='gini')  
        forest.fit(X_train, y_train)
        forest_trials.append({'Depth':depth,'Trees':n, 'cfm':confusion_matrix(y_test, forest.predict(X_test))})

In [None]:
#Just Checking to see if I was appending the data in a way I can work with.
print(forest_trials[0])

Getting the Percentage of correct predictions, Tree count and Depth for each model tested, storing it in a dictionary that can be used by Seaborn

In [None]:
#Creat X/Y to plot to show the accuracy using True Positive + True Negative / total predictions.

tp_fp = {'Accuracy':[],'Trees':[], 'Depth':[]}

for trial in forest_trials:
    true_negative = trial['cfm'][0][0]
    false_positive = trial['cfm'][0][1]
    false_negative = trial['cfm'][1][0]
    true_positive = trial['cfm'][1][1]
    summed = true_positive+true_negative+false_positive+false_negative
    tp_fp['Trees'].append(trial['Trees'])
    tp_fp['Depth'].append(trial['Depth']) 
    tp_fp['Accuracy'].append(round(((true_positive+true_negative)/summed),2)*100)


tp_fp['Accuracy'] = tuple(tp_fp['Accuracy'])
tp_fp['Trees']= tuple(tp_fp['Trees'])
tp_fp['Depth'] = tuple(tp_fp['Depth'])


Initially I tried a basica lineplot of all of our Models, There this graph is showing all of the Tree amounts as well as the Depth levels, There is one glaring issue, the Depth levels beging to overlap A LOT which makes this really messy to try and read for me. The best I could say is that anything past 3 depth seems likely unnecessary and would be computationally inefficient. 

In [None]:
sns.set(style="darkgrid")

ax = sns.lineplot(x='Trees', y="Accuracy", hue="Depth", legend='brief', data=tp_fp)
ax.set(xlabel='Number of Trees', ylabel='Accuracy', title="Tree Count vs Depth")
ax.legend(ncol=3,fontsize= 'large', bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.,title="Tree Depth", labels="123456789")

plt.show()

The Next graph I tried is a scatterplot, this one was easier for me to read as the overlapping was resolved (There is only one Tree amount testing at one depth, my y does not indicate Accuracy anymore, in fact Accuracy is indicated by the Hue and Size of our plotted points (See legen).

In [None]:
sns.set(style="darkgrid")


plt.show()
sns.set(style="ticks")

ax = sns.scatterplot(x="Trees", y="Depth", hue='Accuracy', palette='muted', legend="brief", data=tp_fp)
ax.set(xlabel='Number of Trees', ylabel='Depth', title="Tree VS Depth Prediction Improvements")
ax.legend(ncol=3,fontsize= 'large',bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.,title="Correct Predictions (%)")


It's quite a bit easier for me to see here, that when we are at 4 depth any tree count past 30 is a waste and is likely going to lead to a larger amount time used to create the trees. When increasing the number of trees, while it does affect the accuracy a decent amount it seems to have a less signifigant impact on giving improper predictions. Higher Tree count AND higher tree depth definitely shows very high  successful predictions (often flawless).

Moving on to Logistic Regression, I gathered all of the available solver types as welll as the available Regularization parameters available in the sklearn LogisticRegression module, ran the tests using a max iteration of 50k fit the data, predicted the data and made a scatterplot of it similar to the Random Forests above.

In [None]:

finalVals= []
solvers = {"newton-cg":["l2",'none'], "lbfgs":["l2", 'none'], "liblinear":["l2","l1"], "sag":["l2", 'none'], "saga":["l2","l1", 'none']}
for solver in solvers:
    for penalty in range(len(solvers[solver])):
        model = LogisticRegression(max_iter = 50000, solver=solver,penalty=solvers[solver][penalty])
        model.fit(X_train, y_train)
        prediction = model.predict(X_test)
        cfm = confusion_matrix(y_test, prediction)
        true_negative = cfm[0][0]
        false_positive = cfm[0][1]
        false_negative = cfm[1][0]
        true_positive = cfm[1][1]
        total_correct= round(((true_negative+true_positive) / (true_negative+true_positive+false_positive+false_negative)),2)
        finalVals.append([solver, solvers[solver][penalty], total_correct])
  


This Graph shows some of what I 've known, having no regularization does increase your percentage of correct predictions, but it's likely to not be the most effective model and may be overfitting. Adding  any type of regularization penalty didn't seem to have any drastic effect on any of our models. (Some Solvers are not compatible with some regularization penalties so dots will be missing for those.)

In [None]:
graph = {"solver":[],"penalty":[],"accuracy":[]}
for each_pred in finalVals:
    graph["solver"].append(each_pred[0])
    graph["penalty"].append(each_pred[1])
    graph["accuracy"].append(each_pred[2])
    
ax = sns.scatterplot(x="solver", y="accuracy", hue='penalty', palette='muted', legend="brief", data=graph)
ax.set(xlabel='Solver', ylabel='Accuracy')
ax.legend(ncol=1,fontsize= 'large',bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0, title="Regularization Type (Penalty)")

This little learning "thing" I did has got me considering a larger scale approach to this, having seen these results I wonder if it would be more beneficial to not use a specified random_state and create more than one of each model type, averaging their total predictions may give me a better representation than just predicting the same 1 data set on the same 1 model, just once and plotting it. The same could be done with the random forest testing above, I used a random state to give me some control over the testing when I began working with the output data a bit. 

There was also a 'elastic' penalty that sag could use, but I ran into a couple errors with the iterative process so I have it out for the moment while I figure out how to correct that error.