# WINE CLASSIFIER MODEL
**The question?**

We often "google" reviews while buying or testing various products. So here we are keeping in mind the sentiments of our user and making their life a bit easier by using a classifier to get the perfect review for Wine. 
Assume we have a collection of wine reviews. We'd want to anticipate if a review is in favour or not for each one. This is the same as stating we'd like to create a model that takes a review and predicts whether a particular wine will be good or bad. (To put it another way, we'd like to create a model that takes a document d and predicts a class.)

**The Approach**

We use Naive Bayes algorithm to classify the Wine. 

**Hypothesis**

This algorithm should prove to be fairly accurate, however, one would not expect to have one hundred percent accuracy.

**What is Machine Learning**

Machine learning (ML) is a sort of artificial intelligence (AI) that allows software applications to improve their prediction accuracy without being expressly designed to do so. In order to forecast new output values, machine learning algorithms use historical data as input.

Machine Learning depends on three things, Classification, Prediction and Regression.

Classification : The algorithm's job is to produce a recipe that separates the data if we want to automate a classification process. 
What we have here is a data set with two classifications labeled on it (Y and N). To put it another way, the algorithm must create a fence in the data to best separate the Y’s and N’s. Our goal is to develop a model that can be used to appropriately classify fresh cases. What shape will be fitted to the data is determined by the algorithm.The information is utilized to determine where the optimum location for the fence is.
Prediction: If classification is the process of categorizing data into groups, prediction is the process of fitting a form to the data that is as close to the data as possible. The object we're fitting resembles a skeleton that runs through a single body of data rather than a fence that separates two bodies of data. 
As before, the algorithm provides the WHAT, while the data provides the WHERE. New data points are plugged into the formula, and the anticipated value is read from the line.

In [None]:
import numpy as np

In [None]:
from csv import reader
import random

def load_csv(filename):
    dataset = list()
    with open(filename, 'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

df = load_csv('/content/wine-dataset.data')
len(df)

178

In [None]:
# Converting strings to floats
for i in range(len(df)):
    for j in range(len(df[i])):
        df[i][j]=float(df[i][j])
#Checking if converted right
type(df[0][0])

float

In [None]:
train, test = df[:145], df[146:]

In [None]:
# len(df)

In [None]:
# idx = random.randrange(len(df))
# idx

In [None]:
# def split_data(data, weight):
#     train_length = int(len(data) * weight)
#     train = []
#     for i in range(train_length):
#         idx = random.randrange(len(data))
#         train.append(data[idx])
#         data.pop(idx)
#     return [train, data]

# train, test = split_data(df, 0.7)

In [None]:
X_train = []
y_train = []
X_test = []
y_test = []

for i in range(len(train)):
    y_train.append(train[i][0])
    X_train.append(train[i][1:])
    
for i in range(len(test)):
    y_test.append(test[i][0])
    X_test.append(test[i][1:])

In [None]:
len(X_train), len(y_train), len(X_test), len(y_test)

(145, 145, 32, 32)

Naive Bayes Algorithm 

It's a classification method based on Bayes' Theorem and the assumption of predictor independence. A Naive Bayes classifier, in simple terms, posits that the existence of one feature in a class is unrelated to the presence of any other feature.

Steps to use using Naive Bayes classifier: 

1. Load required libraries- NumPy is the sole library you'll need to create your own Naive Bayes classifier. NumPy is an open source project that aims to make numerical computation possible with Python, and we'll be using it to do arithmetic operations. 

2. Instantiate the class- The next step is to create a new instance of our Naive Bayes classifier. A class functions similarly to an object constructor or a "blueprint" for constructing things. Almost everything in an object-oriented programming language is an object, complete with properties and methods. 

3. Separate Classes- The Bayes Theorem states that we must first determine the prior probability of any class before attempting to predict it. To do so, we must first assign the feature values to the appropriate class. This can be accomplished by splitting the classes and storing them in a dictionary. 
Dictionaries are Python's implementation of an associative array, which is more often known as a data structure. A dictionary is made up of a set of key-value pairs. Each key-value pair corresponds to a certain value.

4. Summary of Features- The likelihood, or the probability of a predictor given a class, is computed using mean and standard deviation and is considered to be normally distributed (Gaussian) (see formula). We'll produce a summary for each feature in the data set, which will make it easier to retrieve the mean and standard deviation of features in the future. 

5. The Gaussian distribution function is used to calculate the likelihood of features that follow a normal distribution.

6. Train the model- Training the model entails applying it to a dataset so that it can iterate through it and learn the dataset's patterns. The mean and standard deviation for each feature of each class are calculated in the Naive Bayes classifier during training. This will enable us to determine the probabilities that will be utilized to make forecasts. 

7. Predict- In order to predict a class, we must first calculate its posterior probability. The predicted class will be the one with the highest posterior probability. 
The posterior probability is calculated by dividing the joint probability by the marginal probability. The denominator, or marginal probability, is the total joint probability of all classes, and it will be the same for all classes. The class with the highest posterior probability, also known as the greatest joint probability, is required.
Predict the class- Once we have the joint probability for each class, we may choose the class with the highest joint probability. 
Putting it all together- We can forecast the class for each row in a test data set by combining the joint probability and predict class steps.

8. Calculating the accuracy score is an important aspect of any machine learning model's testing. To evaluate the performance of our Naive Bayes classifier, we divide the number of right predictions by the total number of predictions, yielding a number ranging from 0 to 1.

In [None]:
class NaiveBayesClassifier:

    def __init__(self):
        pass

    def separate_classes(self, X, y):
        """
        Separates the dataset in to a subset of data for each class.
        Parameters:
        ------------
        X- array, list of features
        y- list, target
        Returns:
        A dictionnary with y as keys and assigned X as values.
        """
        separated_classes = {}
        for i in range(len(X)):
            feature_values = X[i]
            class_name = y[i]
            if class_name not in separated_classes:
                separated_classes[class_name] = []
            separated_classes[class_name].append(feature_values)
        return separated_classes

    def stat_info(self, X):
        """
        Calculates standard deviation and mean of features.
        Parameters:
        ------------
        X- array , list of features
        Returns:
        A dictionary with STD and Mean as keys and assigned features STD and Mean as values.
        """
        for feature in zip(*X):
            yield {
                'std' : np.std(feature),
                'mean' : np.mean(feature)
            }
    
    def fit (self, X, y):
        """
        Trains the model.
        Parameters:
        ----------
        X: array-like, training features
        y: list, target variable
        Returns:
        Dictionary with the prior probability, mean, and standard deviation of each class
        """

        separated_classes = self.separate_classes(X, y)
        self.class_summary = {}

        for class_name, feature_values in separated_classes.items():
            self.class_summary[class_name] = {
                'prior_proba': len(feature_values)/len(X),
                'summary': [i for i in self.stat_info(feature_values)],
            }
        return self.class_summary

    def distribution(self, x, mean, std):
        """
        Gaussian Distribution Function
        Parameters:
        ----------
        x: float, value of feature
        mean: float, the average value of feature
        stdev: float, the standard deviation of feature
        Returns:
        A value of Normal Probability
        """

        exponent = np.exp(-((x-mean)**2 / (2*std**2)))

        return exponent / (np.sqrt(2*np.pi)*std)

    def predict(self, X):
        """
        Predicts the class.
        Parameters:
        ----------
        X: array-like, test data set
        Returns:
        -----------
        List of predicted class for each row of data set
        """
        MAPs = []

        for row in X:
            joint_proba = {}
            
            for class_name, features in self.class_summary.items():
                total_features =  len(features['summary'])
                likelihood = 1

                for idx in range(total_features):
                    feature = row[idx]
                    mean = features['summary'][idx]['mean']
                    stdev = features['summary'][idx]['std']
                    normal_proba = self.distribution(feature, mean, stdev)
                    likelihood *= normal_proba
                prior_proba = features['prior_proba']
                joint_proba[class_name] = prior_proba * likelihood

            MAP = max(joint_proba, key= joint_proba.get)
            MAPs.append(MAP)

        return MAPs

    def accuracy(self, y_test, y_pred):
        """
        Calculates model's accuracy.
        Parameters:
        ------------
        y_test: actual values
        y_pred: predicted values
        Returns:
        ------------
        A number between 0-1, representing the percentage of correct predictions.
        """

        true_true = 0.0

        for y_t, y_p in zip(y_test, y_pred):
            if y_t == y_p:
                true_true += 1 
        print(len(y_test), true_true)
        return true_true / len(y_test)

In [None]:
model = NaiveBayesClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = model.accuracy(y_test, y_pred)
print ("NaiveBayesClassifier accuracy: {0:.6f}".format(model.accuracy(y_test, y_pred)))

32 30.0
32 30.0
NaiveBayesClassifier accuracy: 0.937500


In [None]:
type(acc)

float

In [None]:
for p, a in zip(y_pred, y_test):
  if (p != a):
    print(p, a)

1.0 3.0
1.0 3.0


In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

sk_model = GaussianNB()
sk_model.fit(X_train, y_train)
y_pred = sk_model.predict(X_test)
print("Scikit-learn GaussianNB accuracy: {0:.6f}".format(accuracy_score(y_test, y_pred)))

Scikit-learn GaussianNB accuracy: 0.937500


**Conclusion** 

This pretty much sums up the entire project and the work displayed through the Classifier System. Challenges faced during the project was mainly the coding part and finding the right dataset that would be an appropriate match for our problem statement. 
The project was very insightful and on detailed study, the entire group unanimously agreed that the Classifying system is one of the most expensive technology in the foreseeable future, and we decided on working on it, making note worthy changes that would turn our study into a potential research someday.

 Hoping for the best! 