## **THE FOLLOWING NOTEBOOK CONTAINS A PREDICTIVE MODEL USING NEURAL NETWORK APPLICATION**

**Importing the required libraries for reading the csv file**

In [0]:
import numpy as np
import pandas as pd
import io

**Importing the train and test csv files**

In [4]:
from google.colab import files
uploaded = files.upload()

Saving train.csv to train.csv


In [0]:
df = pd.read_csv(io.BytesIO(uploaded['train.csv']))

In [6]:
upload_test = files.upload()

Saving test.csv to test.csv


In [0]:
df_test = pd.read_csv(io.BytesIO(upload_test['test.csv']))

**Cleaning the review descriptions by removing the stopwords and variety names which are already present in the description as we do not want to include them for training the model**

In [8]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

**Importing the required sequential and dense layers from keras for Neural Network implementation of the model**

In [9]:
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
sw = stopwords.words('english')

Using TensorFlow backend.


**Removing the null entries from the dataset**

In [0]:
df = df[pd.notnull(df.price)]
df = df[pd.notnull(df.variety)]
df = df[pd.notnull(df.points)]

df_test = df_test[pd.notnull(df_test.price)]
df_test = df_test[pd.notnull(df_test.points)]

In [13]:
df_test.head(5)

Unnamed: 0,user_name,country,review_title,review_description,designation,points,price,province,region_1,region_2,winery
0,@paulgwine,US,Boedecker Cellars 2011 Athena Pinot Noir (Will...,Nicely differentiated from the companion Stewa...,Athena,88,35.0,Oregon,Willamette Valley,Willamette Valley,Boedecker Cellars
1,@wineschach,Argentina,Mendoza Vineyards 2012 Gran Reserva by Richard...,"Charred, smoky, herbal aromas of blackberry tr...",Gran Reserva by Richard Bonvin,90,60.0,Mendoza Province,Mendoza,,Mendoza Vineyards
2,@vboone,US,Prime 2013 Chardonnay (Coombsville),"Slightly sour and funky in earth, this is a re...",,87,38.0,California,Coombsville,Napa,Prime
3,@wineschach,Argentina,Bodega Cuarto Dominio 2012 Chento Vineyard Sel...,"This concentrated, midnight-black Malbec deliv...",Chento Vineyard Selection,91,20.0,Mendoza Province,Mendoza,,Bodega Cuarto Dominio
4,@kerinokeefe,Italy,SassodiSole 2012 Brunello di Montalcino,"Earthy aromas suggesting grilled porcini, leat...",,90,49.0,Tuscany,Brunello di Montalcino,,SassodiSole


**For this notebook, I've only used the description column for training the model**

In [0]:
input_data = df['review_description']
output_data = df['variety']

**Label Encoding the output column, i.e. the Variety column**

In [15]:
labelEncoder = LabelEncoder()
output_data = labelEncoder.fit_transform(output_data)
output_data

array([ 5, 17, 11, ...,  6,  0,  3])

**Since the variety names are already present in the description, so for better ML model, we first remove the variety names from the descriptions and clean the data**

In [16]:
wine =df.variety.unique().tolist()
wine.sort()
wine[:10]

['Bordeaux-style Red Blend',
 'Bordeaux-style White Blend',
 'Cabernet Franc',
 'Cabernet Sauvignon',
 'Champagne Blend',
 'Chardonnay',
 'Gamay',
 'Gewürztraminer',
 'Grüner Veltliner',
 'Malbec']

**Now, I've created a new list containing all the variety-names which are now to be removed from the description**

In [17]:
output = set()
for x in df.variety:
    x = x.lower()
    x = x.split()
    for y in x:
        output.add(y)

variety_list =sorted(output)
variety_list[:10]

['blanc',
 'blend',
 'bordeaux-style',
 'cabernet',
 'champagne',
 'chardonnay',
 'franc',
 'gamay',
 'gewürztraminer',
 'grigio']

**Updating the stop-word list by appending the variety name list to it.**

In [0]:
extras = ['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}', 'cab',"%"]
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
stop.update(variety_list)
stop.update(extras)

**Creating a sparse representation of the token counts using the CountVectorizer function**

In [19]:
countVectorizer = CountVectorizer(stop_words = stop)
input_data = countVectorizer.fit_transform(df.review_description)

  'stop_words.' % sorted(inconsistent))


**Splitting the data into train and test**

In [0]:
X_train, X_test, y_train, y_test = train_test_split(input_data, output_data, test_size=0.2, random_state = 1) 

**Adding layers for our Neural Network**

In [21]:
model = Sequential()
model.add(Dense(100, activation='relu', input_dim=len(countVectorizer.get_feature_names())))
model.add(Dense(50, activation='relu', input_dim=len(countVectorizer.get_feature_names())))
model.add(Dense(units=output_data.max()+1, activation='sigmoid'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=5, verbose=1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.callbacks.History at 0x7fda5ea4fbe0>

**Checking the accuracy achieved on training data using the Neural Network**

In [22]:
scores = model.evaluate(X_test, y_test, batch_size = 500, verbose=1)
print ('The accuracy of the model is %s' % scores[1])

The accuracy of the model is 0.6135036945343018


# As seen above, we achieved an accuracy of 61.3% on training data which is slightly less than the accuracy achieved from logistic regression.

# Hence, I've chosen Logistic Regression as the final model for making the predictions.