# Basic analysis
> A basic outline of the penguins dataset analysis

In this notebook, we acclimate ourselves to the penguins dataset to understand what's in it and what we can do with it.

In [None]:
#basic ds package imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
#rmv
#read csv
df = pd.read_csv('penguins.csv')
df.info()

We can see that there are at least a few null here, so we'll probably have to remedy those at some point.  Otherwise, let's take a look at the beginning just to get a sense of the data...

In [None]:
#rmv
#get basic info and preview
display(df.head())
df.shape

Let's get a sense of what happens if we drop rows with any NA in them whatsoever...

In [None]:
#rmv
df.dropna(inplace=True)
df.shape

This is fine.  Let's move forward with this.

# Basic visualizations

In [None]:
def plot_multiple(indf, include=None, exclude=None, no_cols=3, figsize=(9,4)):
    
    #get subset
    plot_df = indf.select_dtypes(include=include) if include is not None else indf.select_dtypes(exclude=exclude)
    plot_type = 'cat' if 'object' in plot_df.dtypes.values else 'num'
    
    #setup subplots
    no_vars = len(plot_df.columns)
    pltsize = None if plot_type=='num' else figsize
    fig, ax = plt.subplots(1, no_vars, figsize=pltsize)
    plt_axs = zip(ax, plot_df.columns.tolist())
    
    #plot based on categorical vs numerical
    if plot_type=='num':
        for ax, col in plt_axs:
            plot_df[col].plot(kind='hist', ax=ax, figsize=figsize)
            ax.set_xlabel(col)
            plt.tight_layout()
    else:
        [plot_df.value_counts(col).plot(kind='bar', ax=ax, ylabel='Frequency') for ax, col in plt_axs]
        plt.tight_layout()
    
    return fig

In [None]:
#rmv
plot_multiple(df, include='object')

In [None]:
#rmv
plot_multiple(df, exclude='object')

# Pairwise relationships

In [None]:
#rmv
sns.pairplot(data=df, hue='species');

In [None]:
#rmv
sns.pairplot(data=df, hue='island');

# Modeling
Here, we generate some test data so we can inspect the confusion matrix and determine a threshold for classification.

In [None]:
def generate_test_data(pred_sz=30):
    
    #randomly generate probabilities
    probs = np.random.rand(pred_sz)
    
    #dummy assignment of actual with a bit of variance mixed in
    actual = probs>=0.72
    random_wrong = np.random.rand(pred_sz)<0.1
    actual[random_wrong] = ~actual[random_wrong]
    actual = actual.astype(int)
    
    #create preds df
    pred_df = pd.DataFrame({'.p0':1-probs, '.p1':probs, '.actual':actual})
    
    #return
    return(pred_df)

In [None]:
#rmv
#generate dataset
pred_df = generate_test_data()
pred_df

In [None]:
def tune_threshold(preds_df, threshold=0.5):
    
    #get threshold and convert to int
    preds_df['.pred_class'] = preds_df['.p1'] >= threshold
    preds_df['.pred_class'] = preds_df['.pred_class'].astype(int)
    
    #plot confusion matrix
    cm = confusion_matrix(preds_df['.actual'], preds_df['.pred_class'])
    disp = ConfusionMatrixDisplay(cm, display_labels=['male', 'female'])
    
    return disp

In [None]:
#rmv
test = tune_threshold(pred_df)
test.plot(cmap='RdPu');

# Predict
Here, we use our model to predict the sex of the penguin.  Note that this is just some random values so you'll expect these to change even if the inputs don't change.

In [None]:
def penguins_predict(input_vec, thresh):
    
    pred_val = np.random.rand(1)
    return 'female' if pred_val >= thresh else 'male'

In [None]:
#rmv
penguins_predict([3,4,5,56], 0.5)

In [None]:
#rmv
#!jupyter nbconvert --RegexRemovePreprocessor.patterns="[.\s]*#rmv.*\s" basic_analysis.ipynb --to script --PythonExporter.exclude_markdown=True