## Business Problem

This project empowers apple growers, distributors, and retailers to efficiently assess the quality of apples using machine learning. By predicting apple quality based on measurable attributes, this project streamlines operations, reduces waste, and enhances customer satisfaction throughout the supply chain, ultimately revolutionizing the apple industry.

## Dataset:
This dataset contains information about various attributes of a set of fruits, providing insights into their characteristics. The dataset includes details such as fruit ID, size, weight, sweetness, crunchiness, juiciness, ripeness, acidity, and quality.

The dataset was generously provided by an American agriculture company. The data has been scaled and cleaned for ease of use.

[Apple Quality: Kaggle](https://www.kaggle.com/datasets/nelgiriyewithana/apple-quality)

In [1]:
import pandas as pd
import zipfile

In [3]:
with zipfile.ZipFile("data.zip", "r") as zip_ref:
    zip_ref.extractall("data")

# Read the extracted CSV file into a DataFrame
data = pd.read_csv("data/apple_quality.csv")

# Display the first few rows of the DataFrame
print(data.head())

   A_id      Size    Weight  Sweetness  Crunchiness  Juiciness  Ripeness  \
0   0.0 -3.970049 -2.512336   5.346330    -1.012009   1.844900  0.329840   
1   1.0 -1.195217 -2.839257   3.664059     1.588232   0.853286  0.867530   
2   2.0 -0.292024 -1.351282  -1.738429    -0.342616   2.838636 -0.038033   
3   3.0 -0.657196 -2.271627   1.324874    -0.097875   3.637970 -3.413761   
4   4.0  1.364217 -1.296612  -0.384658    -0.553006   3.030874 -1.303849   

        Acidity Quality  
0  -0.491590483    good  
1  -0.722809367    good  
2   2.621636473     bad  
3   0.790723217    good  
4   0.501984036    good  


In [9]:
num_rows, num_columns = data.shape
print(f"number of rows: {num_rows}")    # need > 1000
print(f"number of cols: {num_columns}") # need ~10

null_counts = data.isnull().sum()
null_counts

number of rows: 4001
number of cols: 9


A_id           1
Size           1
Weight         1
Sweetness      1
Crunchiness    1
Juiciness      1
Ripeness       1
Acidity        0
Quality        1
dtype: int64

In [10]:
data_cleaned = data.dropna()

# Print the number of rows in the cleaned dataset
num_rows_cleaned = len(data_cleaned)
print("Number of rows after dropping null values:", num_rows_cleaned)

Number of rows after dropping null values: 4000
