# Business Case

**Overview**

In this competition, you will assist Rico Bisquito's cookie company from Homework 0 to help the workers determine if a cookie is defective or not. Here, I have defined defective as a cookie that has a quality issue which thus makes them unsuitable for consumption or sale.

The dataset is derived from cookies I made for a birthday party and my friends who assisted in measuring sensory data.

**Files**

cookie_train.csv - the training set
cookie_test.csv - the test set

**Data Format**

Below are the features in the dataset and their measurement techniques:

- Texture Hardness: This feature is a measure of how hard or soft the cookie is. This is measured using a penetrometer, and the possible values are in Newtons (N).

- Texture Chewiness: This feature is a measure of how much force is required to chew the cookie. This is measured using a texture analyzer, and the possible values are in N.

- Texture Crispiness: This feature is a measure of how crispy or crunchy the cookie is. This is measured using an acoustic measurement system, and the possible values are in arbitrary units.

- Color L*: This feature is a measure of the lightness of the cookie. This is measured using a colorimeter, and the possible values are between 0 (black) and 100 (white).

- Color a*: This feature is a measure of the redness or greenness of the cookie. This is measured using a colorimeter, and the possible values are between -128 (green) and 127 (red).

- Color b*: This feature is a measure of the yellowness or blueness of the cookie. This is measured using a colorimeter, and the possible values are between -128 (blue) and 127 (yellow).

- Taste Sweetness: This feature is a measure of how sweet the cookie tastes. This is measured using a sensory evaluation method, and the possible values are on a scale from 0 (not sweet) to 10 (extremely sweet).

- Taste Saltiness: This feature is a measure of how salty the cookie tastes. This is measured using a sensory evaluation method, and the possible values are on a scale from 0 (not salty) to 10 (extremely salty).

- Taste Bitterness: This feature is a measure of how bitter the cookie tastes. This is measured using a sensory evaluation method, and the possible values are on a scale from 0 (not bitter) to 10 (extremely bitter).

- Shape Diameter: This feature is a measure of the diameter of the cookie. This is measured using a caliper, and the possible values are in millimeters (mm).

- Shape Thickness: This feature is a measure of the thickness of the cookie. This is measured using a caliper, and the possible values are in mm.

- Smell Intensity: This feature is a measure of how strong the cookie smells. This is measured using a sensory evaluation method, and the possible values are on a scale from 0 (no smell) to 10 (extremely strong smell).

- Smell Complexity: This feature is a measure of how complex the aroma of the cookie is. This is measured using a sensory evaluation method, and the possible values are on a scale from 0 (no complexity) to 10 (extremely complex aroma).

- Smell Specific Compound: This feature is a measure of the presence and intensity of a specific aroma compound in the cookie. This is measured using gas chromatography-mass spectrometry (GC-MS), and the possible values are in arbitrary units.

- Detected Chemical: This feature is a measure of what prevalent chemical was detected in the cookie, measured using chromatography. The possible values are just the chemical name.

- Defective: This is the target variable that indicates whether the cookie is defective or not. The possible values are 0 (not defective) and 1 (defective).


# Import Libraries

In [None]:
#libraries for data manipulation
import numpy as np
import pandas as pd

#libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#to remove warning
import warnings
warnings.filterwarnings('ignore')

#to impute na
from sklearn.impute import SimpleImputer


#libraries for model building
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier,StackingClassifier
                             
from xgboost import XGBClassifier

# To tune model, get different metric scores and split data
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score

from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# for creating a pipeline
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Import Dataset

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df_train = pd.read_csv('/kaggle/input/cs506-fall-2023-lab-defective-cookie-detection/cookie_train.csv')
df_test  = pd.read_csv('/kaggle/input/cs506-fall-2023-lab-defective-cookie-detection/cookie_test.csv')

# Data Overview and Sanity Check

In [None]:
# visualize the table
df_train.head()

In [None]:
# check the shape of the dataset
df_train.shape

In [None]:
# check data type
df_train.info()

In [None]:
# investigate deteced chemicals
df_train['Detected Chemical'].value_counts()

In [None]:
# remove ids as we dont need them
df_train.drop('Id',axis=1,inplace=True)

In [None]:
# recheck the data type
df_train.info()

In [None]:
# check for nulls
df_train.isnull().sum()

In [None]:
# check for duplicates
df_train.duplicated().sum()

In [None]:
# check statistical summary
df_train.describe().T

# Exploratory Data Analysis

In [None]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=True, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [None]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=True, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

In [None]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

In [None]:
### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

## 1 Univariate Analysis

In [None]:
# isolate numerical values and visualize their histogram and boxplot
num_col = df_train.select_dtypes(include=np.number).columns.tolist()

# visualize them with a loop
for item in num_col:
    histogram_boxplot(df_train, item)

In [None]:
df_train.info()

In [None]:
# observation on Detected Chemical
labeled_barplot(df_train,'Detected Chemical')

## 2 Bivariate Analysis

In [None]:
# correlation matrix
plt.figure(figsize=(12, 7))
sns.heatmap(df_train[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

In [None]:
# pairplots
sns.pairplot(data=df_train[num_col], diag_kind="kde")
plt.show()

## 3 Multivariate Analysis

In [None]:
sns.pairplot(data=df_train,hue='Defective')
plt.show()

# Data Preprocessing

In [None]:
# replace values in detected chemcial

replaceStruct = {'Detected Chemical': {'?-Pinene': 1, 'Limonene': 2, '?-Caryophyllene': 3, '?-Myrcene': 4,
                 'Linalool':5, '?-Terpinene':6, 'Geraniol':7, '1-Octen-3-ol': 8, 'Ethyl butyrate': 9,
                 'Nerol': 10, 'Eugenol': 11, 'Eucalyptol': 12, 'Citral': 13, 'Citronellal': 14,
                 '2-Nonanone': 15, 'Camphor': 16, '?-Terpineol': 17,'?-Phellandrene': 18, '3-Carene': 19,
                 'Thymol': 20, 'Benzaldehyde': 21, 'Butyric acid': 22, 'Citronellol': 23, 'Furfural': 24,
                 'Geranyl acetate': 25, 'Linalyl acetate': 26, 'Menthol': 27, 'Methoxypyrazine': 28, 
                 '2-Heptanone': 29, 'Methyl salicylate': 30, 'Octenol': 31, 'gamma-Terpinene': 32,
                 'beta-Pinene': 33, 'beta-Myrcene': 34, 'alpha-Terpineol': 35, 'alpha-Cedrene': 36,
                 'Methyl anthranilate': 37, '?-Myrcene': 38, 'Hexanal': 39, 'Pulegone':40, 'Maltol': 41,
                 'Eugenol': 42, 'p-Cymene': 43, 'Isoamyl acetate': 44,'Anethole':45, 'Anisole': 46,'Terpineol': 47}}

                 
df_train= df_train.replace(replaceStruct)
df_test= df_test.replace(replaceStruct)

# Data Preparation for Model Building

In [None]:
# to avoid any data leakage, we will first split the train_csv into train and validation
X = df_train.drop(['Defective'], axis=1)
y = df_train['Defective']

In [None]:
# Splitting data into training, validation
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y)

print(X_train.shape, X_val.shape)

In [None]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1

        },
        index=[0],
    )

    return df_perf

In [None]:
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.accuracy_score)

# Model Building

## 1 Decision Tree

In [None]:
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train, y_train)

In [None]:
decision_tree_perf_train = model_performance_classification_sklearn(
    model, X_train, y_train
)
decision_tree_perf_train

In [None]:
decision_tree_perf_val = model_performance_classification_sklearn(
    model, X_val, y_val
)
decision_tree_perf_val

# Submission

In [None]:
df_test.info()

In [None]:
# save id in test data
test_id = df_test['Id']

In [None]:
# drop Defective id in test data
df_test = df_test.drop('Defective',axis=1)

In [None]:
# drop id in test data
df_test = df_test.drop('Id',axis=1)

In [None]:
df_test['Detected Chemical'] = df_test['Detected Chemical'].astype('category')

In [None]:
df_train.head()

In [None]:
Z = df_test
Z.head()

In [None]:
submission_pred = model.predict(Z)
Z.head()

In [None]:
df_submit = pd.DataFrame({'Id':test_id.values,
                          'Category':submission_pred
                          })
df_submit.head()

In [None]:
import os
os.chdir(r'../working')
df_submit.to_csv(r'submission.csv')
from IPython.display import FileLink
FileLink(r'submission.csv')