**Online shopping has increased its presence a notch higher after Covid19. Customers rely mostly on online reviews before zeroing on any product. This is applicable for any services bought online. This kernel explores some of the essential steps required for text preprocessing, which is key for any NLP project. Here I will work on a set of Wine reviews collected online. This dataset contains reviews for few other products also like lib balm and food wine though its dominated by reviews for alcohol. This kernel aims to quickly extract the key information covered by the reviews without having to go through all of them manually.**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#Load necessary libraries

import nltk
from nltk import FreqDist
import spacy
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# View the data set

wines=pd.read_csv("/kaggle/input/wine-reviews/wine reviews.csv")
wines.head()

In [None]:
wines.drop("Sl.No.",axis=1,inplace=True)

In [None]:
# function to plot most frequent terms

def freq_words(x, terms = 30):
  all_words = ' '.join([text for text in x])
  all_words = all_words.split()

  fdist = FreqDist(all_words)
  words_df = pd.DataFrame({'word':list(fdist.keys()), 'count':list(fdist.values())})

  # selecting top 20 most frequent words
  d = words_df.nlargest(columns="count", n = terms) 
  plt.figure(figsize=(20,5))
  ax = sns.barplot(data=d, x= "word", y = "count")
  ax.set(ylabel = 'Count')
  plt.show()

In [None]:
wines['Reviews Text'].fillna("Good",inplace=True)
wines['Reviews Title'].fillna("Neutral",inplace=True)

In [None]:
freq_words(wines['Reviews Text'])

In [None]:
freq_words(wines['Reviews Title'])

**Most common words are in the Review text are "I',‘the’, ‘and’, ‘a’, etc. In Review Title it is more specific such as, "Great", "Best", although there are terms like "the" and "I". These terms are not relevant and they do not tell details about the review. So its important to eliminate these terms as well as numbers, punctuations, and other special characters from the text.**

In [None]:
# remove unwanted characters, numbers and symbols
wines['Reviews Text'] = wines['Reviews Text'].str.replace("[^a-zA-Z#]", " ")
wines['Reviews Title'] = wines['Reviews Title'].str.replace("[^a-zA-Z#]", " ")

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [None]:
# function to remove stopwords
def remove_stopwords(rev):
    rev_new = " ".join([i for i in rev if i not in stop_words])
    return rev_new

In [None]:
# remove short words (length < 3)
wines['Reviews Text'] = wines['Reviews Text'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))
wines['Reviews Title'] = wines['Reviews Title'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))

# remove stopwords from the text
reviewstext = [remove_stopwords(r.split()) for r in wines['Reviews Text']]
reviewstitle = [remove_stopwords(r.split()) for r in wines['Reviews Title']]

# make entire text lowercase
reviewstext = [r.lower() for r in reviewstext]
reviewstitle = [r.lower() for r in reviewstitle]

In [None]:
freq_words(reviewstext, 35)

In [None]:
freq_words(reviewstitle, 35)

**Now the words are more relevant though there is some more noise. The reviews seem dominated by positive comments. The review text is topped by the word lips maybe due to the reviews given for the lip balm. We can use lemmatization and tokenization to further fine tune the data set.**
> 
> **Tokenization is the process of breaking a sentences into words. Lemmatization converts words in the second or third forms to their first form variants. **

**These tasks can be achieved using the SpaCy library.**

In [None]:
nlp = spacy.load('en', disable=['parser', 'ner'])

# filter noun and adjective
def lemmatization(texts, tags=['NOUN', 'ADJ']): 
       output = []
       for sent in texts:
             doc = nlp(" ".join(sent)) 
             output.append([token.lemma_ for token in doc if token.pos_ in tags])
       return output

In [None]:
tokenized_reviewstext = pd.Series(reviewstext).apply(lambda x: x.split())
print(tokenized_reviewstext[4])

In [None]:
tokenized_reviewstitle = pd.Series(reviewstitle).apply(lambda x: x.split())
print(tokenized_reviewstitle[6])

In [None]:
reviewstextlem = lemmatization(tokenized_reviewstext)
print(reviewstextlem[5])

In [None]:
reviewstitlelem = lemmatization(tokenized_reviewstitle)
print(reviewstitlelem[10])

In [None]:
reviewslemtext = []
for i in range(len(reviewstextlem)):
    reviewslemtext.append(' '.join(reviewstextlem[i]))

wines['reviewstext'] = reviewslemtext

freq_words(wines['reviewstext'], 35)

In [None]:
reviewslemtitle = []
for i in range(len(reviewstitlelem)):
    reviewslemtitle.append(' '.join(reviewstitlelem[i]))

wines['reviewstitle'] = reviewslemtitle

freq_words(wines['reviewstitle'], 35)

**Now let's generate WordClouds for the processed reveiws.**

In [None]:
import PIL
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator

In [None]:
#Use a image for masking
wine_mask = np.array(Image.open("/kaggle/input/wineimage/wineimage.jpg"))


In [None]:
text = " ".join(review for review in wines.reviewstext)

In [None]:

# Create a word cloud image using a mask
wc = WordCloud(background_color="white", max_words=1000, mask=wine_mask)
               

# Generate a wordcloud
wc.generate(text)

# store to file
wc.to_file("/kaggle/working/winereviews.jpg")

# display the image
plt.figure(figsize=[20,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
texttitle = " ".join(review for review in wines.reviewstitle)

In [None]:
# Generate a word cloud image without masking
wordcloud = WordCloud(background_color="black").generate(texttitle)


plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

**You can explore more NLP techniques such as topic modeling using this data set.
For working on wordclouds refer https://www.datacamp.com/community/tutorials/wordcloud-python**