## Problem Statement


### Business Context

Coffee roasting is the process of turning green coffee beans into brown ones. Brown coffee beans can be made in a variety of methods, which also influences the flavor of the end product. A roasting instrument is basically a convection oven. It is a mechanism of inflicting heat energy into the raw product which makes the product consumable.
And the price of coffee is heavily influenced by the quality of the beans after roasting. As a result, the cost can be determined depending on the quality of the beans after roasting.

The rising automation in the manufacturing business necessitates the automation of quality inspection of output products with minimal human intervention. Quality inspectors in businesses examine product quality after it is manufactured to ensure that it meets industry standards. 

Each product's quality inspection is a time-consuming manual process, and a low-quality product wastes upstream factory capacity, consumables, labor, and money. With the emerging AI trend, companies are looking to leverage machine learning-based technologies to automate material quality inspection during the manufacturing process to reduce human intervention while achieving human-level or better accuracy.



### Objective

A roasting corporation named "KC Roasters" has engaged you to predict the quality of a roasting instrument's outputs, which will be used to determine the price of coffee beans.
The quality value ranges from 0 to 100 with 0 being the worst and 100 being the best.
and the higher the quality of the beans, the higher the price.

The coffee roasting instrument used by Roasters is divided into five equal-sized compartments, each with three temperature sensors. 3 sensors have been installed at 3 different locations to be able to capture temperature at different locations inside the chamber.
Additionally, the height of raw material (volume entering the chamber) and relative humidity of roasted material is provided

The data shared consists of 17 predictor variables and a continuous target variable, and the aim is to build a Regression model which can accurately predict the quality of the product. After finding out the quality, the company can decide the cost of beans effectively.


### Data Dictionary
- T_data_1_1 - Temperature recorded by 1st sensor in the 1st chamber in Fahrenheit
- T_data_1_2 - Temperature recorded by 2nd sensor in the 1st chamber in Fahrenheit
- T_data_1_3 - Temperature recorded by 3rd sensor in the 1st chamber in Fahrenheit
- T_data_2_1 - Temperature recorded by 1st sensor in the 2nd chamber in Fahrenheit
- T_data_2_2 - Temperature recorded by 2nd sensor in the 2nd chamber in Fahrenheit
- T_data_2_3 - Temperature recorded by 3rd sensor in the 2nd chamber in Fahrenheit
- T_data_3_1 - Temperature recorded by 1st sensor in the 3rd chamber in Fahrenheit
- T_data_3_2 - Temperature recorded by 2nd sensor in the 3rd chamber in Fahrenheit
- T_data_3_3 - Temperature recorded by 3rd sensor in the 3rd chamber in Fahrenheit
- T_data_4_1 - Temperature recorded by 1st sensor in the 4th chamber in Fahrenheit
- T_data_4_2 - Temperature recorded by 2nd sensor in the 4th chamber in Fahrenheit
- T_data_4_3 - Temperature recorded by 3rd sensor in the 4th chamber in Fahrenheit
- T_data_5_1 - Temperature recorded by 1st sensor in the 5th chamber in Fahrenheit
- T_data_5_2 - Temperature recorded by 2nd sensor in the 5th chamber in Fahrenheit
- T_data_5_3 - Temperature recorded by 3rd sensor in the 5th chamber in Fahrenheit
- H_data - Height of Raw material layer, basically represents the volume of raw material going inside the chamber in pounds.
- AH_data - Roasted Coffee beans relative humidity.
- quality - Quality of the beans


### **Please read the instructions carefully before starting the project.** 
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned. 
* Blanks '_______' are provided in the notebook that 
needs to be filled with an appropriate code to get the correct result. With every '_______' blank, there is a comment that briefly describes what needs to be filled in the blank space. 
* Identify the task to be performed correctly, and only then proceed to write the required code.
* Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
* Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
* Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same.


## Importing necessary libraries

In [None]:
!pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 xgboost==2.0.3 -q --user

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [None]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To tune model, get different metric scores, and split data
from sklearn import metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To impute missing values
from sklearn.impute import SimpleImputer

# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To help with model building
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import (
    BaggingRegressor,
    RandomForestRegressor,
    GradientBoostingRegressor,
    AdaBoostRegressor,
    StackingRegressor,
)
from xgboost import XGBRegressor


# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To suppress warnings
import warnings

warnings.filterwarnings("ignore")

## Loading the dataset

In [None]:
data = pd.read_csv('_______') ##  Complete the code to read the data

## Data Overview

The initial steps to get an overview of any dataset is to: 
- observe the first few rows of the dataset, to check whether the dataset has been loaded properly or not
- get information about the number of rows and columns in the dataset
- find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as expected.
- check the statistical summary of the dataset to get an overview of the numerical columns of the data

### Checking the shape of the dataset

In [None]:
# Checking the number of rows and columns in the training data
data.'_______' ##  Complete the code to view dimensions of the data

### Displaying the first few rows of the dataset

In [None]:
# let's view the top 5 rows of the data
data.'_______' ##  Complete the code to view top 5 rows of the data  

In [None]:
# let's view the last 5 rows of the data
data.'_______' ##  Complete the code to view last 5 rows of the data  

### Checking for duplicate values

In [None]:
# let's check for duplicate values in the data
data.'_______' ##  Complete the code to check duplicate entries in the data

### Checking for missing values

In [None]:
# let's check for missing values in the data
data.'_______' ##  Complete the code to check missing entries in the data

### Checking the data types of the columns for the dataset

In [None]:
# let's check the data types of the columns in the dataset
data.info()

### Statistical summary of the dataset

In [None]:
# let's view the statistical summary of the numerical columns in the data
data.'_______' ##  Complete the code to print the statitical summary of the data

In [None]:
#ceating the copy of the dataframe
df = data.copy()

## Exploratory Data Analysis

### Univariate analysis

In [None]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [None]:
# Observations on T_data_1_1
histogram_boxplot(df, "T_data_1_1", figsize=(12, 7), kde=False, bins=None)

In [None]:
histogram_boxplot(df, "________", figsize=(12, 7), kde=False, bins=None) #  Complete the code to view for T_data_1_2

In [None]:
histogram_boxplot(df, "________", figsize=(12, 7), kde=False, bins=None) #  Complete the code to view for T_data_1_3

In [None]:
# minimum value for 2nd sensor is 168 and 183 for 3rd sensor, so we will replace values less than 168 in first sensor with 168
df["T_data_1_1"].clip(lower=168, inplace=True)

In [None]:
histogram_boxplot(df, "________", figsize=(12, 7), kde=False, bins=None) #  Complete the code to view for T_data_2_1

In [None]:
#  Complete the code to view for T_data_2_2

In [None]:
#  Complete the code to view for T_data_2_3

In [None]:
df["T_data_2_1"].clip(lower=107, inplace=True) # Check for minimum value for 2nd sensor and for 3rd sensor and replace values less than in first sensor

In [None]:
#  Write the code to view for for each senosrs

In [None]:
histogram_boxplot(df, "________", figsize=(12, 7), kde=False, bins=None) #  Complete the code to view for quality

### Bivariate analysis

In [None]:
# Correlation matrix

sns.set(rc={"figure.figsize": (16, 10)})
sns.heatmap(
    df.corr(), annot=True, linewidths=0.5, center=0, cbar=False, cmap="Spectral"
)
plt.show()

In [None]:
sns.set(rc={"figure.figsize": (8, 4)})

# Quality vs AH_data
sns.scatterplot(data=df, x="quality", y="AH_data")

In [None]:
# Quality vs H_data
sns.scatterplot(data=df, x="quality", y="H_data")

In [None]:
# quality vs temp in 1st chamber

fig = plt.figure(figsize = (20,15))
fig.subplots_adjust(hspace=0.4, wspace=0.4)

ax = fig.add_subplot(2, 3, 1)
sns.scatterplot(data=df, x="quality", y="_______")  #  Complete the code to view for T_data_1_1

ax = fig.add_subplot(2, 3, 2)
sns.scatterplot(data=df, x="quality", y="______")  #  Complete the code to view for T_data_1_2

ax = fig.add_subplot(2, 3, 3) 
sns.scatterplot(data=df, x="quality", y="_____")   #  Complete the code to view for T_data_1_3

In [None]:
# Write the code for quality vs temp in each chamber

## Data Pre-Processing

In [None]:
# let's create a copy of the data
df1 = df.copy()

In [None]:
# Dividing data into X and y
X = df1.drop(["quality"], axis=1)
y = df1["quality"]

In [None]:
# Splitting data into training and validation set:

X_train, X_temp, y_train, y_temp = train_test_split('_______') ## Complete the code to split the data into train test in the ratio 60:40

X_test, X_val, y_test, y_val = train_test_split('_______') ## Complete the code to split the data into train test in the ratio 50:50

print(X_train.shape, X_val.shape, X_test.shape)

### Missing value imputation

In [None]:
# creating an instace of the imputer to be used
imputer = SimpleImputer(strategy="median")

In [None]:
# Fit and transform the train data
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)

# Transform the validation data
X_val =  '_______' ## Complete the code to impute missing values in X_val

# Transform the test data
X_test = '_______' ## Complete the code to impute missing values in X_test

In [None]:
# Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print("-" * 30)

X_val.'_______' ## Complete the code to check the count of missing values in validation set
X_test.'_______' ## Complete the code to check the count of missing values in test set

## Model Building

**Let's create a function to calculate different metrics, so that we don't have to use the same code repeatedly for each model.**

In [None]:
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
        },
        index=[0],
    )

    return df_perf

**We are now done with pre-processing and evaluation criterion, so let's start building the model.**

### Decision Tree

In [None]:
dtree = DecisionTreeRegressor(random_state=1)
dtree.fit(X_train, y_train)

In [None]:
dtree_model_train_perf = model_performance_regression(dtree, X_train, y_train)
dtree_model_train_perf

In [None]:
dtree_model_val_perf = model_performance_regression(dtree, X_val, y_val)
dtree_model_val_perf

### Random Forest

In [None]:
rf_estimator = '_______'## Complete the code 
rf_estimator.fit'_______' ## Complete the code to fit the model on data

In [None]:
rf_estimator_model_train_perf = model_performance_regression(
    rf_estimator, X_train, y_train
)
rf_estimator_model_train_perf

In [None]:
rf_estimator_model_val_perf = '_______' ## Complete the code to check the performance on validation set
rf_estimator_model_val_perf

### Bagging Regressor

In [None]:
bag_estimator = '_______'## Complete the code 
bag_estimator.fit'_______' ## Complete the code to fit the model on data

In [None]:
bag_estimator_model_train_perf = model_performance_regression(
    bag_estimator, X_train, y_train
)
bag_estimator_model_train_perf

In [None]:
bag_estimator_model_val_perf = '_______' ## Complete the code to check the performance on validation set
bag_estimator_model_val_perf

### Adaboost

In [None]:
ab_regressor = '_______'## Complete the code 
ab_regressor.fit'_______' ## Complete the code to fit the model on data

In [None]:
ab_regressor_model_train_perf = model_performance_regression(
    ab_regressor, X_train, y_train
)
ab_regressor_model_train_perf

In [None]:
ab_regressor_model_val_perf = '_______' ## Complete the code to check the performance on validation set
ab_regressor_model_val_perf

### Gradient Boosting

In [None]:
gb_estimator = '_______'## Complete the code 
gb_estimator.fit'_______' ## Complete the code to fit the model on data

In [None]:
gb_estimator_model_train_perf = model_performance_regression(
    gb_estimator, X_train, y_train
)
gb_estimator_model_train_perf

In [None]:
gb_estimator_model_val_perf = '_______' ## Complete the code to check the performance on validation set
gb_estimator_model_val_perf

### Xgboost

In [None]:
xgb_estimator = '_______'## Complete the code 
xgb_estimator.fit'_______' ## Complete the code to fit the model on data

In [None]:
xgb_estimator_model_train_perf = model_performance_regression(
    xgb_estimator, X_train, y_train
)
xgb_estimator_model_train_perf

In [None]:
xgb_estimator_model_val_perf = '_______' ## Complete the code to check the performance on validation set
xgb_estimator_model_val_perf

## Model performance comparison

In [None]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        dtree_model_train_perf.T,
        rf_estimator_model_train_perf.T,
        bag_estimator_model_train_perf.T,
        ab_regressor_model_train_perf.T,
        gb_estimator_model_train_perf.T,
        xgb_estimator_model_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision tree",
    "Random forest",
    "Bagging Regressor",
    "Adaboost",
    "Gradient Boosting",
    "Xgboost",
]
print("Training performance comparison:")
models_train_comp_df.T

In [None]:
# validation performance comparison

'_______' ## Write the code to compare the performance on validation set

**After looking at performance of all the models, let's decide which models can further improve with hyperparameter tuning.**

**Note**: You can choose to tune some other model if XGBoost gives error.

## Hyperparameter Tuning

### Tuning RandomForest Regressor model

In [None]:
%%time 

rf_tuned = RandomForestRegressor(random_state=1)

# Grid of parameters to choose from
parameters = {  
                'max_depth':[4, 6, 8, 10, None],
                'max_features': ['sqrt','log2',None],
                'n_estimators': [80, 90, 100, 110, 120]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
randomized_cv = RandomizedSearchCV(rf_tuned, parameters, scoring=scorer, n_iter=40, n_jobs = -1, cv=5, random_state=1)
randomized_cv = randomized_cv.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rf_tuned = randomized_cv.best_estimator_

# Fit the best algorithm to the data. 
rf_tuned.fit'_______' ## Complete the code to fit the model on data

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
# Creating new pipeline with best parameters
rf_tuned = RandomForestRegressor(
    random_state=1, max_depth=None, max_features="log2", n_estimators=110
)

rf_tuned.fit'_______' ## Complete the code to fit the model on data

In [None]:
rf_tuned_train_perf = '_______' ## Complete the code to check the performance on train set
rf_tuned_train_perf

In [None]:
rf_tuned_val_perf = '_______' ## Complete the code to check the performance validation set
rf_tuned_val_perf

### Tuning Bagging Regressor model

In [None]:
%%time 

# defining model
Model = BaggingRegressor(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
              'max_samples': [0.7,0.8,0.9,1], 
              'max_features': [0.7,0.8,0.9,1],
              'n_estimators' : [50, 100, 120, 150],
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=20, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit'_______' ## Complete the code to fit the model on data

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
# Creating new pipeline with best parameters
bag_tuned = BaggingRegressor(
    random_state=1, max_samples=0.7, max_features=0.9, n_estimators=120
)

bag_tuned.fit(X_train, y_train)

In [None]:
bag_tuned_train_perf = model_performance_regression(bag_tuned, X_train, y_train)
bag_tuned_train_perf

In [None]:
bag_tuned_val_perf = '_______' ## Complete the code to check the performance on validation set
bag_tuned_val_perf

### Tuning DecisionTree Regressor model

In [None]:
%%time 

# Choose the type of classifier. 
dtree_tuned = DecisionTreeRegressor(random_state=1)

# Grid of parameters to choose from
parameters = {'max_depth': list(np.arange(15,20)) + [None], 
              'min_samples_leaf': [1, 3] + [None],
              'max_leaf_nodes' : [5, 10, 15] + [None],
              'min_impurity_decrease': [0.001, 0.0]
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
randomized_cv = RandomizedSearchCV(dtree_tuned, parameters, scoring=scorer,cv=5, n_jobs = -1, verbose = 2, n_iter = 100)
randomized_cv = randomized_cv.fit'_______' ## Complete the code to fit the model on data

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
dtree_tuned = DecisionTreeRegressor(
    random_state=1,
    max_depth=None,
    min_samples_leaf=1,
    max_leaf_nodes=None,
    min_impurity_decrease=0.001,
)

dtree_tuned.fit(X_train, y_train)

In [None]:
dtree_tuned_train_perf = model_performance_regression(dtree_tuned, X_train, y_train)
dtree_tuned_train_perf

In [None]:
dtree_tuned_val_perf = '_______' ## Complete the code to check the performance on validation set
dtree_tuned_val_perf

**We have now tuned all the models, let's compare the performance of all tuned models and see which one is the best.**

## Model performance comparison and choosing the final model

In [None]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        dtree_tuned_train_perf.T,
        bag_tuned_train_perf.T,
        rf_tuned_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Tuned Decision Tree",
    "Tuned Bagging regressor",
    "Tuned Random forest",
]
print("Training performance comparison:")
models_train_comp_df.T

In [None]:
# validation performance comparison

'_______' ## Write the code to compare the performance on validation set

**Now we have our final model, so let's find out how our model is performing on unseen test data.**

In [None]:
# Let's check the performance on test set
'_______' ## Write the code to check the performance of best model on test data

### Feature Importances

In [None]:
feature_names = X_train.columns
importances = '_______' ## Complete the code to check the feature importance of the best model
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

## Let's use Pipelines to build the final model

In [None]:
Model = Pipeline('_______' ) ## Complete the code to create pipeline for the best model

In [None]:
# Separating target variable and other variables
X = df.drop(columns="quality")
Y = df["quality"]

In [None]:
# Splitting data into training and test set:

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)

print(X_train.shape, X_test.shape)

In [None]:
Model.'_______' ##  Complete the code to fit the Model obtained from above step

In [None]:
# Let's check the performance on test set
Pipeline_model_test = model_performance_regression(__________, X_test, y_test) ##  Complete the code to check the performance on test set

## Business Insights and Conclusions

- 


***