<a href="https://colab.research.google.com/github/terkoizmy/Recomendation_system/blob/main/Recomend_content.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wine Recomendation

In this implementation, when the user searches for a Wine we will recommend the top 10 similar wine using our wine recommendation system. We will be using an Content-based recommender systems algorithm with TF-idf

link data in [Here](https://www.kaggle.com/zynicide/wine-reviews)

In [1]:
# Unzip data if you need

# !unzip winemag-data_first150k.csv.zip
# !ls

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import re
import random
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [27]:
df = pd.read_csv("winemag-data_first150k.csv")

In [28]:
dataset = df[["country","designation","description"]]
dataset = dataset.dropna()
dataset.drop_duplicates(subset ="designation",
                     keep = False, inplace = True)

In [30]:
dataset

Unnamed: 0,country,designation,description
1,Spain,Carodorum Selección Especial Reserva,"Ripe aromas of fig, blackberry and cassis are ..."
2,US,Special Selected Late Harvest,Mac Watson honors the memory of a wine once ma...
4,France,La Brûlade,"This is the top wine from La Bégude, named aft..."
7,Spain,Carodorum Único Crianza,Lush cedary black-fruit aromas are luxe and of...
8,US,Silice,This re-named vineyard was formerly bottled as...
...,...,...,...
149452,Portugal,Plus 20-year old tawny,"This is a huge, concentrated, dry wine, which ..."
149532,France,Grand Cru Muenchberg,"Minty, herbal notes mark the nose, layered ove..."
149538,US,Delaware Dolce,"There is plenty of sweet fruit, honey, golden ..."
149624,Portugal,Presidential 20-year old tawny,"An easy, fresh, ripe style, with dried fruits ..."


The Wine description is useful for knowing the information content of each wine in the data frame that we enter. wine description created with function creation

In [31]:
def print_description(index):
    example = dataset[dataset.index == index][['description', 'designation', 'country']].values[0]
    if len(example) > 0:
        print(example[0])
        print('designation:', example[1])
        print('country:', example[2])

In [32]:
print_description(12)

A standout even in this terrific lineup of 2015 releases from Patricia Green, the Weber opens with a burst of cola and tobacco scents and accents. It continues, subtle and detailed, with flavors of oranges, vanilla, tea and milk chocolate discreetly threaded through ripe blackberry fruit.
designation: Weber Vineyard
country: US


## Text Preprocessing

After analyzing and describing the wine, the next process is to do text preprocessing which aims to make the data used later can be processed into numbers with TF-IDF and cosine similarity later. The data that will be used is only the 'description' column because in order to get similarities in the application later.

In [33]:
clean_spcl = re.compile('[/(){}\[\]\|@,;]')
clean_symbol = re.compile('[^0-9a-z #+_]')
stopworda = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower() # lowercase text
    text = clean_spcl.sub(' ', text)
    text = clean_symbol.sub('', text)
    text = ' '.join(word for word in text.split() if word not in stopworda) # hapus stopword dari kolom deskripsi
    return text

In [34]:
dataset['desc_clean'] = dataset['description'].apply(clean_text)

## TF-IDF and Cosine Similarity

After the data is preprocessed, use the TF-IDF library and cosine similarity so that it can be converted into a number in the form of a matrix

In [35]:
dataset.set_index('designation', inplace=True)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(dataset['desc_clean'])
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cos_sim

array([[1.00000000e+00, 3.13757120e-03, 1.15394988e-03, ...,
        6.71982903e-04, 3.92640816e-03, 2.06468345e-03],
       [3.13757120e-03, 1.00000000e+00, 1.44474040e-03, ...,
        1.29769330e-03, 0.00000000e+00, 1.19557313e-03],
       [1.15394988e-03, 1.44474040e-03, 1.00000000e+00, ...,
        2.95202813e-03, 3.24299928e-03, 1.65024707e-03],
       ...,
       [6.71982903e-04, 1.29769330e-03, 2.95202813e-03, ...,
        1.00000000e+00, 0.00000000e+00, 1.48228330e-03],
       [3.92640816e-03, 0.00000000e+00, 3.24299928e-03, ...,
        0.00000000e+00, 1.00000000e+00, 1.69820557e-03],
       [2.06468345e-03, 1.19557313e-03, 1.65024707e-03, ...,
        1.48228330e-03, 1.69820557e-03, 1.00000000e+00]])

Then so that we can predict wine recommendations, create the indicies variable as the main indexing set later. Then try to define the variable indicies.

In [36]:
# Set index utama di kolom 'name'
indices = pd.Series(dataset.index)
indices[:50]

0                  Carodorum Selección Especial Reserva
1                         Special Selected Late Harvest
2                                            La Brûlade
3                               Carodorum Único Crianza
4                                                Silice
5                                    Ronco della Chiesa
6                       Estate Vineyard Wadensvil Block
7                                        Weber Vineyard
8                               Château Montus Prestige
9                                6 Años Reserva Premium
10                                           Grignolino
11                                        Giallo Solare
12                                        R-Bar-R Ranch
13                                              Abetina
14                                             Babushka
15                           Nonpareil Trésor Rosé Brut
16                      Vigneto Odoardo Beccari Riserva
17                                         Les 7

## Modelling

At the modeling stage, I created a function for the same hotel recommendation according to the results of the TF-IDF and cosine similarity made. The results will be displayed in the form of the 10 closest order to the hotel name that we define.

In [69]:
def recommendations(name, cos_sim = cos_sim):
    recommended_wine = []
    
    idx = indices[indices == name].index[0]
    score_series = pd.Series(cos_sim[idx]).sort_values(ascending = False)
    top_10_indexes = list(score_series.iloc[1:11].index)

    for i in top_10_indexes:
        name = list(dataset.index)[i]
        recommended_wine.append({'Designation':name, 'Country':dataset['country'][name]})

    df = pd.DataFrame(recommended_wine)
    return df

In [74]:
recommendations("Lot 525")

Unnamed: 0,Designation,Country
0,Sophia's Hillside Cuvée,US
1,Merlo Vineyards,US
2,Hof Zu Pramol,Italy
3,La Rose Saint-Vincent,France
4,Velo Red,US
5,88 Revel Family Vineyard,US
6,Periquita Original,Portugal
7,HMV,US
8,SP 68,Italy
9,Cavalli,Italy
