<a href="https://colab.research.google.com/github/siyabansal9/AIML-2/blob/main/Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name** - Predicting Apple Quality using Machine Learning



##### **Project Type**    - Classification
##### **Contribution**    - Team
##### **Team Member 1 -Siya Bansal(2210992386)**
##### **Team Member 2 -Shruti Sharma(2210992358)**
##### **Team Member 3 -Sudhanshi(2210992413)**
##### **Team Member 4 -Sneha Jindal(2210992390)**

# **Project Summary -**

Project Summary: Predicting Apple Quality with Machine Learning

This project aims to develop a machine learning model capable of accurately predicting the overall quality of an apple. The project will utilize a dataset containing various characteristics of apples, such as size, weight, sweetness, and acidity. By analyzing these features, the model will learn to identify patterns and relationships that correlate with apple quality.

Project Objectives:

Develop a model: We will explore two approaches – classification and regression – to predict apple quality. A classification model will categorize apples into pre-defined quality grades (e.g., excellent, good, bad) based on their features. Conversely, a regression model will predict a numerical score representing the overall quality of an apple.
Data Exploration and Preprocessing: The project will involve a thorough analysis of the apple quality dataset. This includes identifying missing values, outliers, and potential correlations among features. Data cleaning and feature engineering techniques will be employed to ensure the data is suitable for model training.
Model Selection and Training: We will compare different machine learning algorithms like decision trees, support vector machines, or neural networks to determine the best fit for predicting apple quality. The chosen model will be rigorously trained on the prepared dataset, optimizing hyperparameters for optimal performance.
Model Evaluation: Performance evaluation is crucial to ensure the developed model is reliable and generalizable. We will employ metrics like accuracy for classification models and mean squared error for regression models. Additionally, techniques like cross-validation will be used to assess the model's ability to perform on unseen data.


Deliverables:
A well-trained and documented machine learning model capable of predicting apple quality with a high degree of accuracy.
A detailed report outlining the project methodology, data analysis, model selection, training process, and evaluation results.
Code demonstrating the implementation of the chosen model.


Expected Outcomes:
The successful completion of this project will result in a valuable tool for apple farmers, retailers, and consumers. Farmers can utilize the model for automated quality control and optimized pricing strategies based on predicted quality. Retailers can leverage the model to ensure consistent fruit quality for their customers. Ultimately, consumers will benefit from a more consistent and enjoyable apple-buying experience.


Challenges and Anticipated Solutions:
Defining a robust quality metric is crucial. We will explore industry standards and consumer preferences to establish a reliable measure of apple quality.
Accounting for natural variations in fruit is essential. We will leverage data pre-processing techniques and model selection strategies to account for these inherent variations and prevent overfitting.
Ensuring model generalizability requires careful model validation techniques like cross-validation to ensure the model performs well on unseen data.

This project presents a significant opportunity to improve the efficiency and accuracy of apple quality prediction within the industry. By harnessing the power of machine learning, we can create a tool that benefits both producers and consumers, ultimately leading to a more sustainable and enjoyable apple experience.

# **GitHub Link -**

https://github.com/siyabansal9/AIML-2

# **Problem Statement**


**Predicting Apple Quality**

The apple quality dataset provides information on various characteristics of apples, such as size, weight, sweetness, and acidity. The goal is to leverage this data to develop a model that can accurately predict the overall quality of an apple.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import matplotlib.pyplot as plt, pandas as pd, numpy as np, seaborn as sns
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.svm import SVC
import statsmodels.api as sm
from sklearn.preprocessing import RobustScaler


### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("apple_quality.csv")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
df.isnull().sum()

In [None]:
# Dataset Info
df = df.dropna()
print(df.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
duplicate_count

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_values = df.isnull().sum().reset_index()
null_values

In [None]:
df['Acidity'] = pd.to_numeric(df['Acidity'],errors='coerce')

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(),yticklabels=False,cbar=False)

### What did you know about your dataset?

This dataset contains information about various attributes of a set of fruits, providing insights into their characteristics. The dataset includes details such as fruit ID, size, weight, sweetness, crunchiness, juiciness, ripeness, acidity, and quality.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)

In [None]:
# Dataset Describe
df.describe()

### Variables Description

A_id : Unique identifier for each fruit.

Size : Size of the fruit.

Weight : Weight of the fruit.

Sweetness : Sweetness of the fruit.

Crunchiness : Texture indicating the crunchiness of the fruit.

Juiciness : Level of juiciness of the fruit.

Ripeness : Stage of ripeness of the fruit.

Acidity : Acidity level of the fruit.

Qulaity : Overall quality of the fruit.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique().reset_index()
unique_values

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
plt.figure(figsize=(6, 6))
bar = plt.bar(x=df['Quality'].value_counts().index, height=df['Quality'].value_counts().values)
plt.bar_label(bar)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart because it's great for showing the distribution of quality levels in your dataset. Each bar represents a quality category, and its height shows how many instances fall into that category. It's straightforward and easy to compare different quality levels at a glance. Adding numerical labels on top of each bar further clarifies the frequency of each category.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
y=df['Quality']
yid=df['A_id']
x=df.drop(['A_id','Quality'],axis=1)

In [None]:
# Chart - 2 visualization code
x.boxplot()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
for i in x.columns:
    plt.hist(x[i], bins=100, color='pink')
    plt.xlabel(i)
    plt.ylabel('Frequency')
    plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(6, 6))
plt.pie(df['Quality'].value_counts(), labels=['good', 'bad'], autopct='%1.1f%%', colors=['red', 'blue'])
plt.legend()
plt.title('Quality')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
cols = df.columns[:-1]
correlation_matrix = df[cols].corr()
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
plt.figure(figsize=(6, 4))
sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='coolwarm', linewidths=1,
            linecolor='black')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
sns.boxplot(x = "Quality", y = "Crunchiness", hue = "Quality", data = df)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
x.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

.isnull() is an indispensable tool in machine learning for effectively dealing with missing data. By identifying and understanding missing values, you can prepare your data for robust model training and avoid potential biases or errors.

### 3. Categorical Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [None]:
print(df['Quality'].value_counts())

In [None]:
le = LabelEncoder()
df['Quality'] = le.fit_transform(df['Quality'])

In [None]:
df['Quality'].unique()

In [None]:
df.head()

In [None]:
df.shape

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate synthetic classification data with approximately 4000 rows
X, y = make_classification(n_samples=4000, n_features=20, n_classes=2, random_state=42)

# Split the data into training and testing sets with approximately 4000 rows in total
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verify the shapes of the splits
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

##### What data splitting ratio have you used and why?

80% data for training

20% data for test

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
from sklearn.linear_model import LogisticRegression
algo = LogisticRegression()
algo.fit(X_train , y_train)

In [None]:
training_score=algo.score(X_train, y_train)
print("accuracy on traning data : " ,training_score )
testing_score=algo.score(X_test, y_test)
print("accuracy on testing data : " ,testing_score )

In [None]:
y_true= y_test
y_pred= y_test_predict

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, log_loss, confusion_matrix

# Assuming y_pred and y_true are your predicted and actual labels respectively

# Accuracy
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)

# Precision
precision = precision_score(y_true, y_pred)
print("Precision:", precision)

# Recall
recall = recall_score(y_true, y_pred)
print("Recall:", recall)

# F1 Score
f1 = f1_score(y_true, y_pred)
print("F1 Score:", f1)

# Confusion Matrix
conf_matrix = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

### ML Model - 2

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
classifier=DecisionTreeClassifier(criterion='gini' , random_state=42)
classifier.fit(X_train , y_train)
classifier.score(X_test , y_test)

In [None]:
classifier_entropy=DecisionTreeClassifier(criterion='entropy' , random_state=42)
classifier_entropy.fit(X_train , y_train)
classifier_entropy.score(X_test , y_test)

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_sc= sc.transform(X_train)
X_test_sc=sc.transform(X_test)

In [None]:
classifier_sc=DecisionTreeClassifier(criterion='gini' , random_state=42)
classifier_sc.fit(X_train_sc , y_train)
classifier_sc.score(X_test_sc , y_test)

In [None]:
classifier_sc=DecisionTreeClassifier(criterion='entropy' , random_state=42)
classifier_sc.fit(X_train_sc , y_train)
classifier_sc.score(X_test_sc , y_test)

### ML Model - 3

In [None]:
# ML Model - 3 Implementation
# ML Model - 3 Implementation
from sklearn.ensemble import RandomForestClassifier
# Fit the Algorithm
algo = RandomForestClassifier()

In [None]:
algo.fit(X_train , y_train)
train_data=algo.score(X_train , y_train)
print("Train_data_accuracy = " , train_data)
test_data=algo.score(X_test , y_test)
print("Test_data_accuray = " , test_data)

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
x_train_sc= sc.transform(X_train)
x_test_sc=sc.transform(X_test)

In [None]:
algo.fit(X_train_sc , y_train)
train_data=algo.score(X_train_sc , y_train)
print("Train_data_accuracy = " , train_data)
test_data=algo.score(X_test_sc , y_test)
print("Test_data_accuray = " , test_data)

In [None]:
from sklearn.svm import SVC

rbf_classifier= SVC(kernel="linear")
rbf_classifier.fit(X_train , y_train)
y_pred= rbf_classifier.predict(X_test)

rbf_classifier.score(X_test , y_test)

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***