# Week 6 - Analyzing Food Reviews

### Introduction to Natural Language Processing (NLP)

- Automatic computational processing of human languages

- Functions:
    - Take and understand text data
    - Generate natural looking text

- Language is unstructured and raw
</br></br>

<img src="./images/news.png" width="700">

##### **Why is NLP hard?**
- Language is highly variable and ambiguous
- Language is symbolic, discrete and sparse.
- However, humans cannot define rules that govern language

<img src="./images/pizza.png" width="800">


#### Online Video Recap:

- NLP and Applications
- NLP Corpora and Packages
- Tokenization
- Removing Stopwords
- Stemming and Lemmatization
- Synonyms, Antonyms, Hypernyms, Hyponyms
- Exploratory Data Analysis for Text:
    - Word Cloud
    - Word Vectors
    - Dimensionality Reduction and Visualization

#### <span style='color:blue'>Problem: Analyze Food Reviews</span> 

Tasks for this exercise:
1. Clean Food Reviews Data
2. Analyze most frequent words
3. Anaylyze words by rating
4. Use synonyms for seach and analysis
5. Create new features out from raw text
6. Create simple Bag of Words Features
7. Create Word Vectors
8. Visualize Results

#### <span style='color:blue'>Task 1: Fetch Food Reviews Data</span> 

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Get data
import pandas as pd
# read only first 5 columns
food_reviews = pd.read_csv("reviews.csv", usecols=["Restaurant", "Reviewer", "Review", "Rating", "Time"])

In [None]:
# Basic info and summary
print(food_reviews.shape)
food_reviews.head(2)

---

### Tokenization

Tokenization is a fundamental step in natural language processing (NLP) that involves breaking down a stream of text into smaller units called tokens, which can be words, subwords, or characters.

- Tokenization assisst in text normalization and helps in converting text into a structured format that is easier to analyze.
- Simplifies complex text as tokenized text is more manageable and less ambiguous
- Tokenization allows for the extraction of features from text data, such as word frequency, n-grams, and term-document matrices.
- Tokenized text is easier to analyze statistically, allowing for the calculation of metrics such as word frequency, co-occurrence, and sentiment analysis.

#### <span style='color:blue'>Task 2: Tokenize the Reviews</span> 

In [None]:
# Tokenize reviews
tokenized_reviews = food_reviews.loc[:, "Review"].str.split()
tokenized_reviews.head()

Punkt is a data-driven, unsupervised machine learning system for tokenizing text into sentences and words. It is particularly robust and can handle a variety of languages and text formats. The Punkt models are **pre-trained on a large corpus of text, allowing them to accurately identify sentence boundaries and tokenize text effectively**.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
# download the Punkt tokenizer models from NLTK library
nltk.download('punkt')

tokenized_reviews = food_reviews.loc[:, "Review"].astype(str).apply(word_tokenize).copy()
tokenized_reviews.head()

Using the string `split()` function and using NLTK's `word_tokenize` function both serve the purpose of tokenizing text, but they do so in different ways and offer different levels of sophistication and accuracy. Here are the key differences and benefits:

##### String `split()` Function

- The `split()` function splits a string based on a specified delimiter (default is any whitespace).
- It does not take into account linguistic rules or the context of the text.
- Does not handle punctuation correctly (e.g., "world!" remains "world!").
- Does not recognize contractions or other complex linguistic structures.



In [None]:
text = "Hello, world! How's it going?"
tokens = text.split()
print(tokens)


##### NLTK `word_tokenize` Function

- The `word_tokenize` function uses a more sophisticated approach to tokenization, leveraging linguistic rules and models.
- It handles punctuation, contractions, and other complexities in the text more accurately (e.g., splits "world!" into "world" and "!").
- Recognizes contractions and splits them appropriately (e.g., "How's" into "How" and "'s").


In [None]:
from nltk.tokenize import word_tokenize
text = "Hello, world! How's it going?"
tokens = word_tokenize(text)
print(tokens)

---

### Stop Words

Stop words are common words that typically do not carry significant meaning and are often filtered out to focus on the more informative parts of the text.

How does removing stop words help?
- Removing stop words decreases the number of unique words in the text, reducing the dimensionality of the text data.
- Stop words often add noise to the text data without contributing meaningful information. Removing them improves the signal-to-noise ratio
- Eliminates words that do not contribute to the predictive power of the model.
- The focus shifts to more meaningful and content-rich words 

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

#### <span style='color:blue'>Task 3: Remove stopwords and punctuation</span> 

In [None]:
# Add your custom stop words here
custom_stop_words = {'food', 'dinner'}  
all_stop_words = stop_words.union(custom_stop_words)


In [None]:
# lower case strings
tokenized_reviews = tokenized_reviews.apply(lambda x: [word.lower() for word in x])

In [None]:
# Function to remove stop words
def remove_stop_words(tokens):
    return [word for word in tokens if word.lower() not in all_stop_words]

food_reviews_no_stop = tokenized_reviews.apply(remove_stop_words)

In [None]:
# food_reviews.head()
food_reviews_no_stop.head()

Punctuation marks do not carry semantic meaning, and removing them simplifies the text, making it easier to process and analyze.

In [None]:
# Remove punctuations
import string
def remove_punctuation(tokens):
    return [word for word in tokens if word not in string.punctuation]

# Apply the function to the tokenized series
food_reviews_no_punct = food_reviews_no_stop.apply(remove_punctuation)

In [None]:
food_reviews_no_punct.head()

---

### Stemming and Lemmatization

**Stemming** is the process of reducing a word to its base or root form. The resulting stem may not be a valid word, and stemming algorithms typically use heuristic rules to strip suffixes from words.

**Lemmatization** is the process of reducing a word to its base or dictionary form, known as the lemma. Lemmatization considers the context and the part of speech of the word to determine its lemma. It uses linguistic rules and dictionaries.

Lemmatization is more accurate and preferred for tasks requiring accurate word forms and deeper linguistic analysis

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

#### <span style='color:blue'>Task 4: Lemmatize the tokens</span> 

In [None]:
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

for word in ['the', 'children', 'are', 'eating']:
    print(lemmatizer.lemmatize(word))

In [None]:
# Define a function to lemmatize tokens
def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

# Apply the function to the tokenized series
food_reviews_lemmatized = food_reviews_no_punct.apply(lemmatize_tokens)
food_reviews_lemmatized.head()


### Exploratory Data Analysis

### Wordcloud

In [None]:
# create a word cloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(background_color="white", max_words=1000, contour_width=3, contour_color='steelblue')
wordcloud.generate(str(food_reviews_lemmatized))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

#### <span style='color:blue'>Task 5: Perform basic EDA on Reviews data</span> 

In [None]:
# word frequcency count of top 10 words
from collections import Counter
words = []
for i in food_reviews_lemmatized:
    words.extend(i)
word_count = Counter(words)
word_count.most_common(10)

In [None]:
# plot top 10 words
plt.figure(figsize=(10,5))
plt.bar(*zip(*word_count.most_common(10)))
plt.show()

In [None]:
# Show the distribution of review length
food_reviews.loc[:, 'Review'].str.len().hist(bins=100, range=(0, 1500))

#### <span style='color:blue'>Task 6: WordCloud for the best and worse reviews</span> 

In [None]:
# convert to numbers
food_reviews.loc[:, "Rating"] = pd.to_numeric(food_reviews.loc[:, "Rating"], errors='coerce')
food_reviews.loc[:, "Rating"] = food_reviews.loc[:, "Rating"].dropna()

In [None]:
high_review_index = food_reviews.loc[food_reviews["Rating"] > 4].index
low_review_index = food_reviews.loc[food_reviews["Rating"] < 2].index

In [None]:
wordcloud = WordCloud(background_color="white", max_words=1000, contour_width=3, contour_color='steelblue')
wordcloud.generate(str(food_reviews_lemmatized[high_review_index]))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
wordcloud = WordCloud(background_color="white", max_words=1000, contour_width=3, contour_color='steelblue')
wordcloud.generate(str(food_reviews_lemmatized[low_review_index]))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

### Synonym Search

In [None]:
from nltk.corpus import wordnet

# Example word
word = "delicious"

# Get synonyms
synonyms = wordnet.synsets(word)
synonym_list = set()
for syn in synonyms:
    for lemma in syn.lemmas():
        synonym_list.add(lemma.name())

print(synonym_list)


In [None]:
def search_reviews(query, reviews):
    synonyms = wordnet.synsets(query)
    synonym_list = set()
    for syn in synonyms:
        for lemma in syn.lemmas():
            synonym_list.add(lemma.name())

    results = []
    result_idx = []
    for idx, review in enumerate(reviews):
        if any(synonym in review for synonym in synonym_list):
            results.append(review)
            result_idx.append(idx)
    return results, result_idx

query = "delicious"
results, result_idx = search_reviews(query, food_reviews_lemmatized)


In [None]:
food_reviews.loc[result_idx[1], 'Review']

### Feature Extraction and Structuring in Text Analysis

Link to the case study: [Airbnb Example](https://medium.com/airbnb-engineering/prioritizing-home-attributes-based-on-guest-interest-3c49b827e51a)

<img src="./images/airbnb.png" width="600">

#### <span style='color:blue'>Task 7: Extract Boolean attributes from raw reviews</span> 
The task is to pick attributes about the restaurant in 5 main cateogies and check if the reviews mention any of the keywords within those cateogies

In [None]:
food_reviews.loc[:, 'Review_Processed'] = food_reviews_lemmatized

In [None]:
feature_keywords = {
    'good_environment': ['ambience', 'environment', 'atmosphere', 'decor'],
    'good_service': ['service', 'staff', 'waiter', 'waitress', 'server'],
    'good_food': ['delicious', 'tasty', 'yummy', 'flavorful'],
    'expensive': ['expensive', 'pricey', 'costly'],
    'good_value': ['cheap', 'inexpensive', 'affordable', 'value']
}


In [None]:
def extract_features(tokens, feature_keywords):
    features = {}
    for feature, keywords in feature_keywords.items():
        features[feature] = any(token.lower() in keywords for token in tokens)
    return features

# Apply feature extraction to each review
feature_df = food_reviews.loc[:, 'Review_Processed'].apply(lambda tokens: extract_features(tokens, feature_keywords))
features_df = pd.DataFrame.from_records(feature_df.tolist())


In [None]:
features_df.head()

In [None]:
food_reviews = pd.concat([food_reviews, features_df], axis=1)

In [None]:
food_reviews.iloc[:10, [2,6,7,8,9,10]]

In [None]:
# Exercise: Generate feature keywords dictionary using synonyms

---

### Text Representation

Representing text in a numerical format is a fundamental step in natural language processing (NLP). Here are some of the most common methods for representing text, including both traditional and modern techniques:

#### 1. Bag of Words (BoW)

- Represents text by the presence or absence (or frequency) of words.
- Constructs a vocabulary from all the unique words in the corpus.

**Uses:**</br>
- Sentiment Analysis: Determine the sentiment (positive or negative) of a document by counting the occurrences of positive and negative words.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text
corpus = ["This is the first document.", "This document is the second document.", "And this is the third one."]

# Create the BoW model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# Print the vocabulary and the document-term matrix
print(vectorizer.get_feature_names_out())
print(X.toarray())

#### <span style='color:blue'>Task 8: Create a Bag of Words from Food Reviews</span> 

In [None]:
# Create labels based on ratings
food_reviews.loc[:, 'label'] = food_reviews.loc[:, 'Rating'].apply(lambda x: 1 if x > 3 else 0)
y = food_reviews.loc[:, 'label']


In [None]:
joined_reviews = [' '.join(review) for review in food_reviews_lemmatized]

from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the text data
X = vectorizer.fit_transform(joined_reviews)


#### <span style='color:blue'>Task 9: Train and test sentiment analysis model with BoW</span> 

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
from sklearn.naive_bayes import MultinomialNB

# Initialize and train the classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)


In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Predict the labels for the test set
y_pred = clf.predict(X_test)

# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print('Classification Report:')
print(classification_report(y_test, y_pred))


In [None]:
# predict a single review
review = food_reviews_lemmatized[10]
review_vector = vectorizer.transform([' '.join(review)])
print(food_reviews.loc[10, 'Review'])
print(clf.predict(review_vector))


In [None]:
food_reviews['label'].value_counts()


#### 2. Term Frequency-Inverse Document Frequency (TF-IDF)

**Description:**
- Adjusts the frequency of words by how common or rare they are across all documents.
- TF measures how frequently a word appears in a document.
- IDF measures how important a word is in the entire corpus.


**Uses:**
- Document Similarity: Measure similarity between documents, useful in clustering and recommendation systems.
- Keyword Extraction: Identify important words in a document by giving less importance to common words that appear in many documents.




#### 3. Word Embeddings (Word Vectors)

**Description:**
- Dense vector representations of words.
- Capture semantic meaning by placing similar words close to each other in the vector space.
- Common models include Word2Vec, GloVe, and FastText.


<img src="./images/embeddings.png" width="800">
</br>Image Source: nlplanet.org/course-practical-nlp


**Uses:**
- Semantic Similarity: Determine how similar two words or phrases are in meaning.
- Machine Translation: Translate words and phrases by finding equivalent representations in different languages.
- Sentiment Analysis: Capture context and nuances in text by understanding word meanings and relationships.


#### <span style='color:blue'>Task 10: Create Embeddings</span> 

**Word2Vec**, developed by the Google research team led by Tomas Mikolov, is a popular technique for learning word embeddings from text. It's an unsupervised learning model that uses a neural network to produce dense, continuous vector representations of words, capturing their semantic meanings and relationships.



In [None]:
from gensim.models import Word2Vec

# Sample sentences
sentences = [["this", "is", "the", "first", "document"],
             ["this", "is", "the", "second", "document"],
             ["and", "this", "is", "the", "third", "one"]]

# Train the Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=5, min_count=1, workers=4)

# Print the vector for the word 'document'
print(model.wv['document'])

In [None]:
import numpy as np

# Load GloVe vectors
glove_vectors = {}
with open('glove/glove.6B.50d.txt', 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        glove_vectors[word] = vector

# Example: Get the vector for the word 'food' from GloVe
glove_food_vector = glove_vectors.get('food')
print(glove_food_vector)


In [None]:
# get cosine similarity
from scipy.spatial import distance

cosine_similarity = 1 - distance.cosine(glove_vectors.get('dinner'), glove_vectors.get('lunch'))
print(f'Cosine similarity: {cosine_similarity}')



#### 4. One-Hot Encoding

**Description:**
- Represents words as vectors with the same length as the vocabulary.
- Each word is represented by a vector with a 1 in the position corresponding to the word’s index in the vocabulary and 0s elsewhere.

**Uses:**
- Text Classification: Simple representation for short and well-defined text categories, such as document classification.
- Sequence Models: Input for neural network models in sequence-to-sequence tasks like language modeling.

<img src="./images/onehot.png" width="500">
</br>Image Source: medium.com/analytics-vidhya/one-hot-encoding-of-text-data-in-natural-language-processing-2242fefb2148





#### 5. n-Grams

**Description:**
- Represents text by considering sequences of n words together.
- Can be used to capture context and local word order.


**Uses:**
- Text Generation: Generate text by predicting the next word based on the previous n words.
- Spell Correction: Identify common misspellings by analyzing sequences of characters.


#### Summary

- **Bag of Words (BoW):** Simple and effective for basic tasks.
- **TF-IDF:** Adds importance weighting to words.
- **Word Embeddings (Word2Vec, GloVe, FastText):** Capture semantic similarity and context.
- **One-Hot Encoding:** Simple but high-dimensional.
- **n-Grams:** Capture local word order.

---

### Applications of these techniques

- Sentiment Analysis
- Topic Modeling
- Named Entity Recognition
- Text Classification