# **Wine Classification**

-------------

## **Objective**

The objective is to develop a classification model to classify wine into it's different types. PySpark is used to process the large-scale data. By leveraging advanced data processing techniques and machine learning algorithms, this project aims to create a model capable of accurately determining the type of wine based on relevant input features such as percentage of alcohol, malic acid, alkalinity of ash etc.

## **Data Source**

https://github.com/YBIFoundation/Dataset/raw/main/Wine.csv

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285387 sha256=e118e279eb0c39a96da0150a5239adcd5ce22572ce1fa2340e6574945a6c2e8b
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1


## **Import Library**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyspark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').getOrCreate()

In [None]:
spark

## **Import Data**

In [None]:
df = pd.read_csv('https://github.com/YBIFoundation/Dataset/raw/main/Wine.csv')
spdf = spark.createDataFrame(df)

In [None]:
spdf

DataFrame[class_label: bigint, class_name: string, alcohol: double, malic_acid: double, ash: double, alcalinity_of_ash: double, magnesium: bigint, total_phenols: double, flavanoids: double, nonflavanoid_phenols: double, proanthocyanins: double, color_intensity: double, hue: double, od280: double, proline: bigint]

In [None]:
spdf.printSchema()

root
 |-- class_label: long (nullable = true)
 |-- class_name: string (nullable = true)
 |-- alcohol: double (nullable = true)
 |-- malic_acid: double (nullable = true)
 |-- ash: double (nullable = true)
 |-- alcalinity_of_ash: double (nullable = true)
 |-- magnesium: long (nullable = true)
 |-- total_phenols: double (nullable = true)
 |-- flavanoids: double (nullable = true)
 |-- nonflavanoid_phenols: double (nullable = true)
 |-- proanthocyanins: double (nullable = true)
 |-- color_intensity: double (nullable = true)
 |-- hue: double (nullable = true)
 |-- od280: double (nullable = true)
 |-- proline: long (nullable = true)



In [None]:
spdf.show()

+-----------+----------+-------+----------+----+-----------------+---------+-------------+----------+--------------------+---------------+---------------+----+-----+-------+
|class_label|class_name|alcohol|malic_acid| ash|alcalinity_of_ash|magnesium|total_phenols|flavanoids|nonflavanoid_phenols|proanthocyanins|color_intensity| hue|od280|proline|
+-----------+----------+-------+----------+----+-----------------+---------+-------------+----------+--------------------+---------------+---------------+----+-----+-------+
|          1|    Barolo|  14.23|      1.71|2.43|             15.6|      127|          2.8|      3.06|                0.28|           2.29|           5.64|1.04| 3.92|   1065|
|          1|    Barolo|   13.2|      1.78|2.14|             11.2|      100|         2.65|      2.76|                0.26|           1.28|           4.38|1.05|  3.4|   1050|
|          1|    Barolo|  13.16|      2.36|2.67|             18.6|      101|          2.8|      3.24|                 0.3|        

## **Describe Data**

In [None]:
spdf.describe().show()

+-------+------------------+----------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+--------------------+------------------+------------------+-------------------+------------------+------------------+
|summary|       class_label|class_name|           alcohol|        malic_acid|               ash| alcalinity_of_ash|        magnesium|     total_phenols|        flavanoids|nonflavanoid_phenols|   proanthocyanins|   color_intensity|                hue|             od280|           proline|
+-------+------------------+----------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+--------------------+------------------+------------------+-------------------+------------------+------------------+
|  count|               178|       178|               178|               178|               178|               178|              178|

## **Data Preprocessing**

In [None]:
spdf.columns

['class_label',
 'class_name',
 'alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280',
 'proline']

In [None]:
from pyspark.ml.feature import VectorAssembler

In [None]:
featureassembler = VectorAssembler(inputCols=['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280',
 'proline'], outputCol='Features')

featureassembler

VectorAssembler_8b7e7f215bc0

## **Define Target Variable (y) and Feature Variables (X)**

In [None]:
modeldata = featureassembler.transform(spdf).select('Features','class_label')
modeldata.show()

+--------------------+-----------+
|            Features|class_label|
+--------------------+-----------+
|[14.23,1.71,2.43,...|          1|
|[13.2,1.78,2.14,1...|          1|
|[13.16,2.36,2.67,...|          1|
|[14.37,1.95,2.5,1...|          1|
|[13.24,2.59,2.87,...|          1|
|[14.2,1.76,2.45,1...|          1|
|[14.39,1.87,2.45,...|          1|
|[14.06,2.15,2.61,...|          1|
|[14.83,1.64,2.17,...|          1|
|[13.86,1.35,2.27,...|          1|
|[14.1,2.16,2.3,18...|          1|
|[14.12,1.48,2.32,...|          1|
|[13.75,1.73,2.41,...|          1|
|[14.75,1.73,2.39,...|          1|
|[14.38,1.87,2.38,...|          1|
|[13.63,1.81,2.7,1...|          1|
|[14.3,1.92,2.72,2...|          1|
|[13.83,1.57,2.62,...|          1|
|[14.19,1.59,2.48,...|          1|
|[13.64,3.1,2.56,1...|          1|
+--------------------+-----------+
only showing top 20 rows



## **Train Test Split**

In [None]:
train_data, test_data = modeldata.randomSplit([0.8,0.2])

## **Modelling**

In [None]:
from pyspark.ml.classification import DecisionTreeClassifier

In [None]:
dt = DecisionTreeClassifier(featuresCol='Features',labelCol='class_label')

In [None]:
dt = dt.fit(train_data)

##**Prediction**

In [None]:
y_pred = dt.transform(test_data)

In [None]:
y_pred.show()

+--------------------+-----------+------------------+-----------------+----------+
|            Features|class_label|     rawPrediction|      probability|prediction|
+--------------------+-----------+------------------+-----------------+----------+
|[11.65,1.67,2.62,...|          2|[0.0,0.0,47.0,0.0]|[0.0,0.0,1.0,0.0]|       2.0|
|[11.84,0.89,2.58,...|          2|[0.0,0.0,47.0,0.0]|[0.0,0.0,1.0,0.0]|       2.0|
|[12.17,1.45,2.53,...|          2|[0.0,0.0,47.0,0.0]|[0.0,0.0,1.0,0.0]|       2.0|
|[12.33,1.1,2.28,1...|          2| [0.0,0.0,2.0,0.0]|[0.0,0.0,1.0,0.0]|       2.0|
|[12.64,1.36,2.02,...|          2|[0.0,0.0,0.0,29.0]|[0.0,0.0,0.0,1.0]|       3.0|
|[12.67,0.98,2.24,...|          2|[0.0,0.0,47.0,0.0]|[0.0,0.0,1.0,0.0]|       2.0|
|[13.29,1.97,2.68,...|          1|[0.0,52.0,0.0,0.0]|[0.0,1.0,0.0,0.0]|       1.0|
|[13.34,0.94,2.36,...|          2| [0.0,0.0,0.0,1.0]|[0.0,0.0,0.0,1.0]|       3.0|
|[13.41,3.84,2.12,...|          1|[0.0,52.0,0.0,0.0]|[0.0,1.0,0.0,0.0]|       1.0|
|[14

In [None]:
y_pred.groupBy('class_label','prediction').count().show()

+-----------+----------+-----+
|class_label|prediction|count|
+-----------+----------+-----+
|          2|       2.0|   13|
|          1|       1.0|    5|
|          2|       3.0|    2|
|          3|       2.0|    2|
|          3|       3.0|   10|
|          2|       1.0|    1|
+-----------+----------+-----+



In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
org = y_pred.select('class_label').collect()
pred = y_pred.select('prediction').collect()

In [None]:
cm = confusion_matrix(org,pred)
cm

array([[ 5,  0,  0],
       [ 1, 13,  2],
       [ 0,  2, 10]])

## **Model Evaluation**


In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol='class_label',predictionCol='prediction',metricName = 'weightedPrecision')

In [None]:
accuracy = evaluator.evaluate(y_pred)

In [None]:
accuracy

0.8494949494949495

In [None]:
spark.stop()

## **Explanation**

A classification model is trained using 80% of the initial data and tested on the remaining 20% of the data. The confusion matrix and the accuracy measure help us evaluate the performance of the model. The accuracy of the classification model is about 85%. It has classified the wine into types - Barolo, Grignolino and Barbera where each type corresponds to class labels 1,2 and 3 respectively.