# Milk Quality Prediction
### Objectives
Techniques Used
- Data Cleaning
- Data Visualization
- Machine Learning Modeling

Algortihms Used
- KNN

Model Evaluation Methods Used
- Accuracy Score
<hr>

Let's load required libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.preprocessing as pp
%matplotlib inline

### Load Data From CSV File

In [None]:
data = pd.read_csv('data.csv')
data.head()

### Data Visualization and Analysis
Let’s see how many of each class is in our data set

In [None]:
data.describe()

In [None]:
data['Grade'].value_counts()

429 Low Quality, 374 Medium Quality, 256 High Quality

In [None]:
data.hist(column='Temprature', bins=50)

### Feature set
Let's convert the Pandas data frame to a Numpy array:

In [None]:
X = data[['pH','Temprature','Taste','Odor','Fat','Turbidity','Color']] .values
y = data[['Grade']] .values
X[0:5]

### Normalize Data
Standardizing data, which involves giving the data a zero mean and unit variance, is a good practice to follow, especially for algorithms that rely on the distance between data points, such as KNN.

In [None]:
X = pp.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]

### Train Test Split
Out of Sample Accuracy is the percentage of correct predictions the model makes on data that has not been trained on. If a model is trained and tested on the same dataset, the out of sample accuracy will likely be low due to overfitting. To improve accuracy, Train/Test Split can be used. This method involves dividing the dataset into mutually exclusive training and testing sets. The model is trained on the training set and tested on the testing set, providing a more realistic evaluation of out-of-sample accuracy for real-world problems.

In [None]:
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=4)

### Classification
##### K nearest neighbor (KNN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
k = 3
kcls = KNeighborsClassifier(n_neighbors=k)
kcls.fit(train_x,train_y)

### Accuracy of other K
To choose the right K value in KNN, a part of the data should be set aside for testing the model. Starting from K=1, the accuracy of the model should be calculated and repeated by increasing K value. Based on results, the best K can be determined.

In [None]:
from sklearn.metrics import accuracy_score
Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
for n in range(1,Ks):
    neigh = KNeighborsClassifier(n_neighbors = n).fit(train_x,train_y)
    yhat=neigh.predict(test_x)
    mean_acc[n-1] = accuracy_score(test_y, yhat)
    std_acc[n-1]=np.std(yhat==test_y)/np.sqrt(yhat.shape[0])

mean_acc

### Plot for other K

In [None]:
plt.plot(range(1,Ks),mean_acc,'b--')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.fill_between(range(1,Ks),mean_acc - 3 * std_acc,mean_acc + 3 * std_acc, alpha=0.10,color="blue")
plt.legend(('Accuracy ', '+/- 1xstd','+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()

print( "The best accuracy was", mean_acc.max(), "with k=", mean_acc.argmax()+1) 

### Accuracy evaluation
Multilabel classification measures subset accuracy by calculating how closely actual and predicted labels match in the test set. The accuracy classification score function is equivalent to the jaccard_score function.

In [None]:
train_y_hat = kcls.predict(train_x)
test_y_hat = kcls.predict(test_x)
print('train accuracy score:' , accuracy_score(train_y,train_y_hat))
print('test accuracy score:' , accuracy_score(test_y,test_y_hat))

### Author
Sahand Sabet (https://github.com/sahandsbt)
<hr>
<h3 align="center"> © Sa-S.ir 2023. All rights reserved. </h3>