<font color="#de3023"><h1><b>MAKE A COPY OF THIS NOTEBOOK SO YOUR EDITS ARE SAVED</b></h1></font>

# Introduction to AI and Sentiment Analysis with Yelp Reviews

Today, we will develop a machine learning model to determine sentiments expressed in Yelp reviews, classifying them as either positive or negative. This introduces the concept of **sentiment analysis**, a form of natural language processing (NLP) that quantifies individuals' opinions (i.e. **good or bad**) from their textual expressions.

<!-- ---

**Discussion Prompt:** Consider other contexts in which sentiment analysis could be beneficial for businesses or organizations. How might they leverage this technology?

--- -->

In this notebook, we'll:

1. Explore and manipulate a real Yelp review dataset.
2. Preprocess text data with tokenization and vectorization.
3. Learn word embeddings using pre-trained models.
4. Build and train an RNN for sentiment analysis.
5. Evaluate the model's performance on unseen data.


<!-- * **Explore and manipulate data:** Get hands-on experience with the Yelp review dataset created directly from real reviews from Yelp.
* **Preprocess text data:** Learn to convert text into a format suitable for NLP tasks through tokenization and vectorization.
* **Introduction to word embeddings:** Utilize pre-trained models to transform words into numerical representations.
* **Build and train a model:** Implement a recurrent neural network (RNN) to analyze text data and predict sentiments.
* **Evaluate and iterate:** Test the model's performance on unseen data! -->

**Discussion Prompt:** Consider other contexts in which sentiment analysis could be beneficial for businesses or organizations. How might they leverage this technology?

By the end of this, you will not only be able to build a sentiment analysis classifier but also gain insights into the practical challenges and decisions that come with developing AI models.

Let's get started!

<center> <img src=https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%201%20-%205/Session%203%20-%20NLP/Taco%20Bell%20Reviews.png> </center>

In [None]:
#@title Import our libraries and data (Make sure you use a GPU runtime!)
import pandas as pd   # Great for tables (google spreadsheets, microsoft excel, csv).
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import nltk
import spacy
import wordcloud
import os # Good for navigating your computer's files
import sys
pd.options.mode.chained_assignment = None #suppress warnings

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from spacy.lang.en.stop_words import STOP_WORDS
nltk.download('wordnet')
nltk.download('punkt')

from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!python -m spacy download en_core_web_md
import en_core_web_md
text_to_nlp = spacy.load('en_core_web_md')

import scipy
# from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity

def cosine(word1, word2):

  vector1 = word1.reshape(1, -1)
  vector2 = word2.reshape(1, -1)

  return cosine_similarity(vector1, vector2)[0][0]


# Import our data
!wget -q --show-progress "https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%201%20-%205/Session%203%20-%20NLP/yelp_final.csv"

# 🔍 Data Exploration

First, let's start by loading our review data. The data is stored in a file named `yelp_final.csv`. You can see this file for yourself by clicking the folder icon on the left-hand side of the screen. We will use the `read_csv` function from the pandas library to load the data! Good times.

In [None]:
# read our data in using 'pd.read_csv('file')'
yelp_full = pd.read_csv('yelp_final.csv')
yelp_full.head()

💬 **Discussion:**

- **Output Variable Identification:** Which column in the dataset represents the user's sentiment about the restaurant? Think about how the data in this column could be used as a label (i.e. good or bad) for training our model.

- **Input Variable Identification:** Which column in the dataset represents the user's review about the restaurant?

- **Privacy Considerations:**
   - Notice that the business and user identifiers are not real names but appear as random strings. This technique is known as [hashing](https://medium.com/tech-tales/what-is-hashing-6edba0ebfa67), a common method to ensure privacy.
   - Discuss why you think real names are not included in this dataset. What are the potential risks of using real names in publicly available data?

**Next Steps:**
We will keep only the columns necessary for our sentiment analysis. Which are they? Put the columns names in the list below!

### Instructor Solution  
<details><summary>click to reveal!</summary>

* **Output Variable Identification:** The 'stars' column
* **Input Variable Identification:** The 'text' column
* **Privacy Considerations:** Even if not explicitly in the dataset, a real name can be used to trace back to sensitive information (address, birth date, etc.), leading to potential harrassment, identity theft, employment risks, & reputational damage.



In [None]:
needed_columns = []  # Replace the empty list with the column names as strings

# Using only the needed columns from the original dataset
yelp = yelp_full[needed_columns]
yelp.head()

In [None]:
#@title Instructor Solution

needed_columns = ["stars", "text"]  # Replace the empty list with the column names as strings

# Using only the needed columns from the original dataset
yelp = yelp_full[needed_columns]
yelp.head()


Currently, our main focus is on the 'text' column, which contains the reviews. These reviews express the users' sentiments and provide insights into how they felt about the businesses. Let's examine a few of these reviews to gain a better understanding of our dataset.

In [None]:
#@title Explore Reviews Based on Star Ratings
#@markdown Use this interactive tool to examine how the content of reviews varies with different star ratings.


# Set the number of stars to select reviews
num_stars =  1 #@param {type:"integer"}

# Print the first 20 reviews that match the selected star rating
print(f"Displaying the first 20 reviews rated with {num_stars} stars:\n")
for review_text in yelp[yelp['stars'] == num_stars]['text'].head(20).values:
    print("\n")
    print(review_text + "\n")
    print("\n")
    print("-"*2000)


💬 **Discussion:**

- **Vocabulary in High Ratings:** What common words or phrases do you find in reviews with high ratings (e.g., 4 or 5 stars)?
  
- **Vocabulary in Low Ratings:** What words or phrases frequently appear in reviews with low ratings (e.g., 1 or 2 stars)?

- **Notable Exceptions:** Can you identify any reviews that contain unexpected phrases or sentiments?

**Consider Further:**

- **Impact of Language on Perception:** How might the language used in a review influence a reader's perception of the business? Discuss the potential consequences for businesses based on the language used in customer reviews.

### Instructor Solution  
<details><summary>click to reveal!</summary>

* **Vocabulary in High Ratings:** Words like "great", "love", "favorite" are often used in highly rated reviews
* **Vocabulary in Low Ratings:** Words like "bad", "rude", "disappointing" are often used in low rated reviews
* **Notable Exceptions:** "The food is good. Better then Dennys but not as good as Mimi's" is another example of an exception
* **Impact of Language on Perception:** Tone, choice of words, and various factors can negatively or positively impact a business's reputation, customer trust, and subsequent financial success.


In [None]:
from collections import Counter

# Set the number of stars to select reviews
num_stars =  5 #@param {type:"integer"}

# Get text for star rating
text_for_star = yelp[yelp['stars'] == num_stars]['text'].values

# Convert text to one string
string_text_for_star= str(text_for_star)

# Split the string by words
words_for_star = string_text_for_star.split()
print(f"Words for {num_stars} star reviews: ", words_for_star)

# Pass the list of words to an instance of Counter class
Counter = Counter(words_for_star)

# Get most common 20 words
most_common_words = Counter.most_common(50)

print("\n")
print(f"Most commmon words for {num_stars} star reviews: ", most_common_words)

### 💡 Exercise: Crafting Rules for Sentiment Analysis

Think about the reviews you've looked at. Imagine you're designing a simple system to tell if a review is **positive** or **negative** based only on what words it uses. This is the basis of a rule-based classifier: it uses specific rules you set to make decisions.

For example, reviews containing the word "good" could be positive, while those with the word "bad" might be negative.

As a group, let's come up with set of rules using combinations of words that might help identify the sentiment of a review. Write down your ideas below!


In [None]:
#@title Define Your Sentiment Analysis Rules

#@markdown Rule 1: Describe a combination of words or a pattern that typically indicates a positive review.
rule_1 = "" #@param {type:"string"}

#@markdown Rule 2: Describe a combination of words or a pattern that typically indicates a negative review.
rule_2 = "" #@param {type:"string"}

#@markdown Rule 3: Enter an additional rule or an exception you observed.
rule_3 = "" #@param {type:"string"}



**Discuss**:

Do you think the rules you've created will perform well in accurately classifying review sentiments? Why or why not?

---

### 💡 Bonus Exercise: Implement and Test Your Sentiment Analysis Rules

Now it's time to put your rules to the test! Write a function that uses one of the rules you developed to determine whether a review is **positive** or **negative**. We've provided the basic structure of the function below. Replace the `pass` statement with your own code to implement your rule.



In [None]:
def classify(text):
    # YOUR CODE HERE
    # Implement your rule to classify the sentiment of the review.
    # You might start with a simple 'if' statement checking for certain words or phrases.
    pass  # Remove 'pass' and replace it with your implementation.

In [None]:
#@title Instructor  Solution
# It checks if the word 'bad' appears in the text.
def classify(text):
    # If the word 'bad' is found, the function returns 'negative'.
    if 'bad' in text:
        return 'negative'
    # If the word 'bad' is not found, it assumes the review is 'positive'.
    else:
        return 'positive'

### More Complex Solution
def classify(text):
    # This function classifies a review as 'negative', 'neutral', or 'positive'.
    # It checks for the presence of specific keywords to determine the sentiment.

    # Define lists of positive and negative keywords.
    positive_keywords = ['great', 'amazing', 'loved', 'excellent', 'good', 'wonderful']
    negative_keywords = ['bad', 'worst', 'disappointing', 'poor', 'terrible', 'awful']

    # Count occurrences of positive and negative keywords.
    positive_count = sum(word in text for word in positive_keywords)
    negative_count = sum(word in text for word in negative_keywords)

    # Determine sentiment based on the counts of positive and negative words.
    if positive_count > negative_count:
        return True
    elif negative_count >= positive_count:
        return False

In [None]:
#@title  🧪  Let's Test Your Sentiment Classification Function
#@markdown Enter your own review text below to see how your function classifies it:

# User inputs their own review text.
input_review = "hey there i love this stuff and it is amazing" #@param {type:"string"}

# Call the classify function with the user input.
if input_review:  # Check if the input string is not empty
    sentiment = classify(input_review)
    print(f"Review: {input_review}")
    print(f"Sentiment: {sentiment}")
else:
    print("Please enter a review text to classify.")


# ⚙️ Processing the Data for Machine Learning

As we transition from manually crafting rules to employing more sophisticated machine learning techniques, we will prepare our data for analysis using a Recurrent Neural Network (RNN). This type of model is particularly effective for processing sequences, such as text, due to its ability to maintain information across inputs!


### 💡 Exercise: Binary Classification of Review Sentiments

Here we want to classify Yelp reviews into two sentiment categories: **positive** and **negative**. To simplify our task into a binary classification problem, we will:

- Label reviews with 4 and 5 stars as 'positive'.
- Label reviews with 1, 2, and 3 stars as 'negative'.

We've already provided the function definition `is_good_review`. Fill in the `None` with the correct expression to corretly divide the dataset into two goups, good or bad!

Please complete the function below and run it to create a new `is_good_review` column.

In [None]:
def is_good_review(num_stars):
    # This function categorizes reviews based on the number of stars.
    # It returns True if the review is positive (4 or 5 stars).
    # It returns False if the review is negative (1, 2, or 3 stars).

    # Replace 'None' with the appropriate condition for a positive review.
    if None:  # YOUR CODE HERE
        return True
    else:
        return False

In [None]:
#@title Instructor Solution
def is_good_review(num_stars):
    # This function categorizes reviews based on the number of stars.
    # It returns True if the review is positive (4 or 5 stars).
    # It returns False if the review is negative (1, 2, or 3 stars).

    # Replace 'None' with the appropriate condition for a positive review.
    if num_stars >= 4:  # YOUR CODE HERE
        return True
    else:
        return False

In [None]:
# Apply the function to the 'stars' column to create a new 'is_good_review' column.
# This column will have a Boolean value where True represents a 'good' review and False represents a 'bad' review.
yelp['is_good_review'] = yelp['stars'].apply(is_good_review)

# Display the first few rows to verify the changes.
yelp.head()

In [None]:
# FIRST: Make sure your classifier returns True or False (for good vs. bad reviews)

# Helper function to show predictions
def show_pred(y_test,y_pred):
  table=pd.DataFrame([[t for t in reviews],y_pred, y_true]).transpose()
  table.columns = ['Text', 'Predicted Category', 'True Category']
  accuracy = (sum(table['Predicted Category'] == table['True Category'])/len(table['True Category']))
  print("Accuracy: {:.2%}".format(accuracy))
  return table

reviews = yelp['text']
#@title Bonus: test your rule based classifier's accuracy on all the reviews
y_true = yelp["is_good_review"]
#Use classify_rb to make predictions
y_pred = [classify(review) for review in reviews] # a list of predictions
#Display the tweet with predicted and True category
show_pred(y_true,y_pred)

## ✂️ Tokenization

The first step in processing text is **tokenization**, which involves breaking down the text from a single string into individual words, or "tokens." Try it out yourself by entering some text into the cell below to see how it's tokenized into a list of words.

In [None]:
#@title Basic Tokenization Example
#@markdown Enter any text in the field below to see how it is broken down into tokens. Tokenization splits the text into words or symbols.

example_text = "What's the word?" #@param {type:"string"}

# Check if the example text is not empty
if example_text:
    tokens = word_tokenize(example_text)
    print("Tokens:", tokens)
    print("Number of tokens:", len(tokens))
else:
    print("Please enter some text to tokenize.")


### 💬 Discussion: Analyzing Tokenization Rules

1. **Tokenization Patterns:**
   - Observe and discuss the rules the tokenizer follows when splitting text into tokens, especially how it handles punctuation like periods, commas, and hyphens.

2. **Evaluation and Modification:**
   - How does the tokenizer handle punctuation? What would you change about it?

Reflect on these aspects and share your insights!


---

## 📚 Understanding Word Embeddings with Word2Vec

When we work with text in machine learning, we can't use words directly. Instead, we need to convert words into **vectors** or **embeddings**. Think of these as lists of numbers that represent each word in a way that a computer can understand and process.

### What is Word2Vec?

**Word2Vec** is a popular method to create these embeddings. It was developed by researchers at Google and has become a standard tool in machine learning for handling text. Word2Vec transforms each word into a dense vector that captures much more information about the word than just its meaning. For instance, words that appear in similar contexts, like "school" and "teacher," will have vectors that are closer together or are more 'similar'.

### How Does Word2Vec Work?

Word2Vec models are trained to understand language based on actual sentences. It looks at each word and its neighboring words to predict words from context, or the other way around.


### Visualizing Word2Vec Embeddings

The image below displays a Word2Vec model's word embeddings plotted in a three-dimensional space. Each dot represents a word, and its position is determined by the similarity its' meaning has to other words. Words that are similar are clustered together. Don't worry about what the axis represents; all the matters is how close they are to each other!

[Word2Vec Visualization](https://projector.tensorflow.org/)

**Explore the Visualization:**
- **Feel free to zoom in on the clusters to observe how closely related the words are.**

### Why Use Word2Vec?

Word2Vec allows computers to understand words in a more human-like way, recognizing synonyms, related terms, and even grammatical patterns. This ability is incredibly powerful for tasks like translation, search engines, and of course as it applies to us, sentiment analysis.

## 🔍 Exploring Word Embeddings with spaCy

Having seen how Word2Vec can visualize semantic relationships between words, let's now apply these concepts practically using the spaCy!

### Understanding Our Tools

- **Model Overview:** We will use `en_core_web_md`, a medium-sized Word2Vec model provided by spaCy. This model has been trained on a vast corpus of text from the internet, enabling it to understand language by analyzing the contexts in which words appear.
- **Helper Function:** We've provided a function `word2vec(word)` that uses this model to convert any given word into its corresponding word embedding. This representation captures the word's semantic essence based on its usage across millions of sentences.

You can use the function like this: `vec = word2vec(word)`

Let's start by converting words to their emebddings!

In [None]:
#@title Click here to load the word2vec function!
import warnings
warnings.filterwarnings('ignore')

# Load the SpaCy Word2Vec model
# spacy.prefer_gpu()
text_to_nlp = en_core_web_md.load()

# Function to convert a word to its vector representation
def word2vec(word):
    return text_to_nlp(word).vector

### 💡 Exercise: Word Embeddings

Retrieve the word embedding for the word "student" using the provided `word2vec` function. Then, discuss the following:

- Determine the length of the embedding vector for "student." What does this tell you about the nature of word embeddings?
- Compare the length of vectors for different words. Are they consistent across various words?


In [None]:
# Define the word you want to analyze
word = "student"
word2 = "teacher"
# Retrieve the word embedding vector for the word "student"
word_embedding = None # Replace None with the function call to get the vector

# Get the length of the word embedding vector
length_word_embedding = None  # Replace None with code to calculate the length of the vector

# Print the word embedding vector and its length
print("Word Embedding:", word_embedding)
print("Length of Word Embedding:", length_word_embedding)


In [None]:
#@title Instructor Solution
# Define the word you want to analyze
word = "student"
word2 = "teacher"
# Retrieve the word embedding vector for the word "student"
word_embedding = word2vec(word) # Replace None with the function call to get the vector

# Get the length of the word embedding vector
length_word_embedding = len(word_embedding)  # Replace None with code to calculate the length of the vector

# Print the word embedding vector and its length
print("Word Embedding:", word_embedding)
print("Length of Word Embedding:", length_word_embedding)


# ✍ Similarity Using Word Vectors

Word vectors allow us to quantify how similar two words are by comparing their embeddings (i.e. vectors). To measure this similarity, we use the cosine similarity metric, which can be calculated using the function `cosine(vector1, vector2)`.



![](https://storage.googleapis.com/lds-media/images/cosine-similarity-vectors.original.jpg)

### 💡 Exercise

Your task is to explore the similarity between word pairs:

**Compute Similarity:**
   - Write code to compute the similarity score between the words "walk" and "run".

**Experiment with Pairs:**
   - Find pairs of words where:
     - The similarity score is greater than 0.8.
     - The similarity score is less than 0.2.
     - The similarity score is unexpectedly high or low based on your intuition about the words.

Use the cosine similarity function mentioned above to carry out these comparisons, and discuss your findings. Remember, you'll need to transform your words to vectors first!

In [None]:
### YOUR CODE HERE
def similarity(word1, word2):
  pass

similarity('walk', 'run')
### END CODE

In [None]:
#@title Instructor Solution
def similarity(word1, word2):

  vector1 = word2vec(word1)
  vector2 = word2vec(word2)

  return cosine(vector1, vector2)

similarity('student', 'teacher')
### END CODE

## 🎓 **[Optional]** Advanced Challenge Exercise: Computing Cosine Similarity

Cosine similarity measures the cosine of the angle between two non-zero vectors. This is used to assess how close two items are. It ranges from -1 (exactly opposite) to 1 (exactly the same), with 0 typically indicating no similarity.

### 📐 Cosine Similarity Formula

The cosine similarity between two vectors $ \mathbf{A} $ and $ \mathbf{B} $ is calculated as follows:

$$ \text{Cosine Similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} $$

Where:
- $ \mathbf{A} \cdot \mathbf{B} $ is the dot product of the vectors,
- $ \|\mathbf{A}\| $ and $ \|\mathbf{B}\| $ are the norms (or magnitudes) of the vectors. Really, this is just another fancy way of saying "length".

To successfully implement this, here are some helpful hints regarding the functions and libraries you might need:



#### Functions and Methods to Use:
1. **`np.dot()` or `@` operator:** Use this to compute the dot product of two vectors. This function takes two arrays and returns their dot product.
   
   ```python
   dot_product = np.dot(vector1, vector2)
   # or
   dot_product = vector1 @ vector2
   ```

2. **`np.linalg.norm()`:** This function computes the norm (magnitude) of a vector. You'll need to calculate the norm for both vectors involved in the cosine similarity.

   ```python
   norm_vector = np.linalg.norm(vector1)
   ```
   
Use these functions to calculate the cosine similarity according to the formula:

$$ \text{Cosine Similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} $$

In [None]:
#Your code here! Assume the vectors are numpy arrays already!
def my_cosine_similarity(vec1, vec2):
    dot_product = None #Fill me in
    norm_vec1 = None   #Fill me in
    norm_vec2 = None   #Fill me in
    similarity = None  #Fill me in
    return similarity


In [None]:
#@title Instructor Solution

def my_cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity

In [None]:
#@title Run this to check if your function is correct!

# Example vectors
vector1 = np.array([1, 2, 3])
vector2 = np.array([1, 5, 7])

# Compute the cosine similarity
similarity_score = my_cosine_similarity(vector1, vector2)
print("Cosine Similarity:", similarity_score)
print("Correct answer: 0.9875414397573881")

## 🔄 Putting it all together: Preparing Your Data for the Model

Having explored tokenization and word embeddings, we're now ready to apply these concepts to prepare our data for the model! This step is crucial as it transforms raw text into a structured format that our machine learning algorithms can understand and learn from.

### 💡 Exercise
Identify which columns in the `yelp` dataframe should be used as your X (inputs/features) and which should be your y (outputs/labels). Complete the following code to specify these columns:


In [None]:
# Specify the input features (X) and output labels (y) from the 'yelp' dataframe
X_columns = "" # Replace with the names of column to be used as input features
y_column = ""  # Replace with the name of the column to be used as output labels


X_text = yelp[X_columns]
y = yelp[y_column]

In [None]:
#@title Instructor Solution
# Specify the input features (X) and output labels (y) from the 'yelp' dataframe
X_columns = "text" # Replace with the names of columns to be used as input features
y_column = "is_good_review"  # Replace with the name of the column to be used as output labels


X_text = yelp[X_columns]
y = yelp[y_column]

##💡 Exercise: Preparing Text Data for Machine Learning

In this exercise, you'll use some functions we wrote  to prepare text data for a machine learning model. This involves tokenizing the text, converting it into numerical embeddings, and making sure each text entry is the same length before feeding it into the model!

1. **Load and Tokenize Text:**
   - Use the function `tokenize_and_embed(text_data)` to process your text data in batches. This function handles the tokenization and conversion of text into embeddings for you.

2. **Standardize Text Length:**
   - Apply `standardize_length(embeddings)` to ensure that all text data have the same number of features. This function finds the longest text and pads the others accordingly.

3. **Convert to Machine Learning Format:**
   - Finally, use `convert_to_array(padded_embeddings)` to transform your standardized text data into a format suitable for machine learning models.

<!--

**Disclaimer: Simplifying Text Processing for Educational Purposes**

In this educational module, we are focusing on higher-level concepts and applications of machine learning rather than delving deeply into every preprocessing step. One such step we are simplifying is the process of standardizing the length of text entries before they are fed into a machine learning model.

Ensuring that each text entry is the same length is crucial for many machine learning algorithms, especially neural networks, as they require fixed-size inputs. This process, often achieved through padding shorter texts with zeros, can involve intricate choices about text truncation, padding strategies, and the handling of embeddings.

However, to keep our focus on the broader application of natural language processing (NLP) techniques and to ensure that students are not overwhelmed by the complexity of data preprocessing, we are using predefined functions to manage this step. This approach allows students to concentrate on understanding how machine learning models operate on text data and the impact of NLP in real-world applications, rather than getting bogged down in the details of text length standardization.

By abstracting away these details, we aim to make the learning experience more accessible and engaging, allowing students to build a foundational understanding before tackling more complex aspects of NLP and AI. -->

In [None]:
#@title Run This to Load Our Functions!

def tokenize_and_embed(text_data):
    """
    Tokenizes the text data and converts it to word embeddings using SpaCy.
    Args:
        text_data (list): A list of text strings to be processed.
    Returns:
        list: A list of lists containing embeddings for each token in each document.
    """
    docs = list(text_to_nlp.pipe(text_data))
    embeddings = [[token.vector for token in doc] for doc in docs]
    return embeddings

def standardize_length(embeddings):
    """
    Ensures all embedding lists are the same length by padding shorter ones with zero vectors.
    Args:
        embeddings (list): A list of lists of embeddings.
    Returns:
        list: A list of lists with padded embeddings to ensure uniform length.
    """
    max_length = max(len(tokens) for tokens in embeddings)
    embedding_dim = len(embeddings[0][0]) if embeddings[0] else 0
    padded_embeddings = [[np.zeros(embedding_dim)] * (max_length - len(tokens)) + tokens for tokens in embeddings]
    return padded_embeddings

def convert_to_array(padded_embeddings):
    """
    Converts a list of padded embeddings into a numpy array.
    Args:
        padded_embeddings (list): A list of lists of padded embeddings.
    Returns:
        numpy.ndarray: A numpy array containing the embeddings suitable for machine learning input.
    """
    return np.array(padded_embeddings)

In [None]:
# X_text contains our text data to be processed
X_embeddings = None  # Tokenize and get embeddings

X_padded = None  # Standardize lengths

X = None  # Convert to numpy array suitable for model input

if X: print(f"The shape of our dataset is now: {X.shape}")

In [None]:
#@title Instructor Solution
# X_text contains our text data to be processed
X_embeddings = tokenize_and_embed(X_text)  # Tokenize and get embeddings

X_padded = standardize_length(X_embeddings)  # Standardize lengths

X = convert_to_array(X_padded)  # Convert to numpy array suitable for model input

print(f"The shape of our dataset is now: {X.shape}")

## Exercise: Splitting Data into Training and Testing Sets

Now that you have processed your reviews, the next step is to divide this data into training and testing sets.

Use the `train_test_split()` function to create training and testing datasets.

```python
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=.2, random_state=1)
```

- **test_size**: This parameter controls the proportion of the data that will be split into the testing set.

<!--
- **random_state**: Setting this parameter ensures that the split is reproducible. -->

In [None]:
#Fill in None!
X_train, X_test, y_train, y_test = train_test_split(None, None, test_size=.2, random_state=1)

In [None]:
#@title Instructor Solution
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)


# Recurrent Neural Networks (RNNs)

We are now going to explore a special type of neural network called a Recurrent Neural Network (RNN). While we have briefly touched on other types of neural networks before, RNNs are unique because they can process sequences of data in order. This makes them particularly useful for tasks where the sequence or the order of data points is important.


**How Do RNNs Work?**

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/rnn-many-to-many-different-ltr.png?8ca8bafd1eeac4e8c961d9293858407b" width="500">


Unlike traditional neural networks, which treat each input independently, RNNs have loops in them that allow information to persist. In simpler terms, RNNs can remember information about what has been processed so far, enabling them to make predictions based on the sequence of data received.

**Examples of RNN Applications:**

- **Stock Prices Prediction:** RNNs can predict future stock prices by learning from past stock price trends.
- **Language Modeling:** They can predict the next word in a sentence based on the words that came before, which is useful in text auto-completion tools.
- **Weather Forecasting:** RNNs can predict future weather conditions by analyzing the patterns in past weather data.

RNNs are well-suited and indispensable for many tasks in fields like finance, natural language processing, and meteorology!


# Exercise

We've built the RNN model for you! All you need to do is train it using the .fit() function.

In [None]:
#@title Run this to load the RNN Model!
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping

class RNNClassifier:
    def __init__(self, num_epochs=30, lstm_units=50, dropout_rate=0.7):
        self.num_epochs = num_epochs
        self.lstm_units = lstm_units
        self.dropout_rate = dropout_rate
        self.model = self.build_model()

    def build_model(self):
        model = Sequential()
        model.add(LSTM(self.lstm_units, return_sequences=True))
        model.add(Dropout(self.dropout_rate))
        model.add(LSTM(self.lstm_units))
        model.add(Dropout(self.dropout_rate))
        model.add(Dense(1, activation='sigmoid'))
        optimizer = Adam(learning_rate=0.001)
        model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
        return model

    def fit(self, X_train, y_train, X_val=None, y_val=None, **kwargs):

        """
        Comment is necessary due to how over complicated I made this. - joel

        Fits the model to the training data. Supports optional validation data.
        If validation data is provided, early stopping is used if not it's not! haha

        Args:
            X_train (array): Training data features.
            y_train (array): Training data labels.
            X_val (array, optional): Validation data features.
            y_val (array, optional): Validation data labels.
            **kwargs: Additional keyword arguments to pass to the model's fit method.

        Returns:
            A history object containing training history.
        """

        if X_train is None and y_train is None:
          print("Arguments are none. Retry with correct arguments.")
          return None

        callbacks = kwargs.pop('callbacks', [])

        if X_val is not None and y_val is not None:
            early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
            callbacks.append(early_stopping)
            return self.model.fit(X_train, y_train, epochs=self.num_epochs, validation_data=(X_val, y_val), callbacks=callbacks, batch_size=32, verbose=1, **kwargs)
        else:
            return self.model.fit(X_train, y_train, epochs=self.num_epochs, batch_size=32, verbose=1, callbacks=callbacks, **kwargs)

    def predict(self, *args, **kwargs):
        predictions = self.model.predict(*args, **kwargs)
        return (predictions > 0.5).astype(int)

    def predict_proba(self, *args, **kwargs):
        return self.model.predict(*args, **kwargs)

    def score(self, X, y):
        predictions = self.predict(X)
        return accuracy_score(y, predictions)

    def __getattr__(self, name):
        if name != 'predict' and name != 'predict_proba':
            return getattr(self.model, name)
        else:
            raise AttributeError(f"'{self.__class__.__name__}' object has no attribute '{name}'")

In [None]:
#YOUR CODE HERE! Replace the nones!
rnn = RNNClassifier(num_epochs=30, lstm_units=50, dropout_rate=0.5)
rnn.fit(None, None)

In [None]:
#@title Instructor Solution
rnn = RNNClassifier(num_epochs=30, lstm_units=50, dropout_rate=0.5)
rnn.fit(X_train, y_train)

### ✍ Exercise: Testing Your Model
Now, let's evaluate our model's accuracy! Your model needs to **predict** the sentiment, and then you'll **calculate the accuracy** using the `accuracy_score()` function. **Which dataset** should you use?

In [None]:
y_pred = # YOUR CODE HERE
accuracy = # YOUR CODE HERE
print(accuracy)

In [None]:
#@title Instructor Solution
y_pred = rnn.predict(X_test) #YOUR CODE HERE
accuracy = accuracy_score(y_pred, y_test)
print(accuracy)

Congratulations - you've trained and tested your model! It's not perfect, but a whole lot better than a coin flip :)


### ✍ Exercise: Trying Out Reviews

Accuracy only tells us so much! It's often useful to figure out **what sorts** of mistakes your model makes.

Try enterning some reviews below and explore:

*   What kind of reviews does your model classify correctly? For example, do long or short reviews work better?
*   What kind of reviews does your model get wrong? Does it understand sarcasm or other "tricky" language?
*   Does it seem like your model pays attention to particular words?



In [None]:
#@title Enter a review to see your model's classification
example_review = "This was a horrible place!" #@param {type:'string'}

# Assuming the functions tokenize_and_embed, standardize_length, and convert_to_array are defined in the same script or imported
# First, wrap the example review in a list since our functions expect a list of texts
example_reviews = [example_review]

# Tokenize and convert the review text to embeddings
X_embeddings = tokenize_and_embed(example_reviews)  # Tokenize and get embeddings

# Standardize lengths of the embeddings
X_padded = standardize_length(X_embeddings)  # Standardize lengths

# Convert the padded embeddings into a numpy array suitable for the model
X = convert_to_array(X_padded)  # Convert to numpy array suitable for model input

prediction = rnn.predict(X)
if prediction[0]:
  print ("This was a GOOD review!")
else:
  print ("This was a BAD review!")



#Exploring Impact and Ethics



Whenever we explore a new potential use of AI, it is crucial to have a discussion about the **societal and ethical impact** if it were to be implemented at a large scale.

*Illustration: erhui1979/iStock*

<center> <img src="https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%201%20-%205/Session%203%20-%20NLP/AI%20Ethics.png"> </center>

### 📈 Who might this AI impact?

An imporant part of incorporating AI into your businesses is discuss how it would impact all areas of business.
Let's come up with 3 groups people that would be impacted by an AI that can classify reviews as positive or negative. We will call these groups `stakeholders`.



In [None]:
stakeholder = '' #@param {type:"string"}
stakeholder = '' #@param {type:"string"}
stakeholder = '' #@param {type:"string"}




*   **Discuss**: For each of those stakeholders, what are some benefits of this AI model? What are some drawbacks?


> *Hint: What do each of those stakeholders care about?*



* **Discuss**: What are some societal outcomes that can occur to if we had a lot of **false positives** (negative reviews misclassified as positive reviews)? How about **false negatives** (positive reviews misclassified as negative reviews)?

*   **Discuss**: What are some potential sources of bias?

*   **Discuss**: What are some other ethical questions you can come up with?






# Optional Advanced Challenge: Linear Algebra and Embeddings

(Heads-up: this challenge section is math-heavy!)

One reason text embeddings are cool is that we can use them to explore connections in meaning between different words, including calculating similarity between words and completing [analogies](http://epsilon-it.utu.fi/wv_demo/).

To get started, we'll first create a vocabulary of the most common words from our Yelp reviews dataset. We'll use a technique called the Bag of Words (BOW) model with a Counter Vectorizer, which counts how often each word appears. From this, we'll select the top 500 most frequently used words to form our vocabulary.

Next, we'll create a dictionary containing the vectors for all the words in our vocabulary. This dictionary will help us analyze the relationships between words. If you want to use more than 500 words, feel free to change that number!

In [None]:
#@title Run this to define our vocabulary builder!
nltk.download('stopwords')

from nltk.corpus import stopwords
from collections import Counter

def build_vocab_dict(texts, top_n=500):
    """
    Builds a dictionary of the most common words and their embeddings using SpaCy.

    Args:
        texts (list of str): The list of texts from which to build the vocabulary.
        top_n (int): The number of top words to include in the vocabulary.

    Returns:
        dict: A dictionary mapping words to their embeddings.
    """
    # Tokenize the text and lower case each word
    tokens = [word.lower() for text in texts for word in word_tokenize(text)]

    # Remove stopwords and non-alphabetic tokens
    filtered_tokens = [token for token in tokens if token.isalpha() and token not in stopwords.words('english')]

    # Count the occurrences of each word
    word_counts = Counter(filtered_tokens)

    # Select the top 'top_n' most common words
    most_common_words = [word for word, count in word_counts.most_common(top_n)]

    # Create a dictionary for the most common words and their embeddings
    vocab_dict = {}
    for word in most_common_words:
        token = text_to_nlp.vocab[word]
        if token.has_vector:  # Check if the token has a vector in the model's vocabulary
            vocab_dict[word] = token.vector
        else:
            # Handle out-of-vocabulary words by assigning a zero vector
            embedding_dim = text_to_nlp.vocab.vectors_length
            vocab_dict[word] = np.zeros((embedding_dim,))

    return vocab_dict

# Example usage:
# X_text_example = ["This is the first document.", "This document is the second document.", "And this is the third one."]
# vocab_dict = build_vocab_dict(X_text_example)
# print(vocab_dict)


In [None]:
vocab_dict = build_vocab_dict(X_text, top_n = 800)

for word, vec in vocab_dict.items():
  print(word)

print ('{} words in our dictionary'.format(len(vocab_dict)))

### Cosine Similarity
Next, let's calculate the similarity between two words, using their Word2Vec representations. As before, we'll use cosine similarity to measure the similarity between our vectors.

As an example, imagine we had two three-dimensional vectors:

In [None]:
v0 = [2,3,1]
v1 = [2,4,1]

Run the code below to plot those vectors, and try changing the numbers above.
How can you make a very small angle between the vectors? How can you make a very large angle?

In [None]:
#@title Run this to create an interactive 3D plot
#NOTE: Would be extra cool with sliders for the vector coordinates! - DREW
#Code from https://stackoverflow.com/questions/47319238/python-plot-3d-vectors
import numpy as np
import plotly.graph_objs as go

def vector_plot(tvects,is_vect=True,orig=[0,0,0]):
    """Plot vectors using plotly"""

    if is_vect:
        if not hasattr(orig[0],"__iter__"):
            coords = [[orig,np.sum([orig,v],axis=0)] for v in tvects]
        else:
            coords = [[o,np.sum([o,v],axis=0)] for o,v in zip(orig,tvects)]
    else:
        coords = tvects

    data = []
    for i,c in enumerate(coords):
        X1, Y1, Z1 = zip(c[0])
        X2, Y2, Z2 = zip(c[1])
        vector = go.Scatter3d(x = [X1[0],X2[0]],
                              y = [Y1[0],Y2[0]],
                              z = [Z1[0],Z2[0]],
                              marker = dict(size = [0,5],
                                            color = ['blue'],
                                            line=dict(width=5,
                                                      color='DarkSlateGrey')),
                              name = 'Vector'+str(i+1))
        data.append(vector)

    layout = go.Layout(
             margin = dict(l = 4,
                           r = 4,
                           b = 4,
                           t = 4)
                  )
    fig = go.Figure(data=data,layout=layout)
    fig.show()


vector_plot([v0,v1])

## Exercise: Identifying Similar Words Using Your Cosine Similarity Function

In this exercise, you will apply your own implementation of cosine similarity to find the most similar word to a given target word in a vocabulary. You’ll be using the `my_cosine_similarity` function that you wrote earlier, leveraging it to compare word vectors and identify the closest matches.

### What You'll Do

Write a function named `find_most_similar` that utilizes your `my_cosine_similarity` function to determine which word in a predefined vocabulary is most similar to a specified target word. The function should return both the most similar word and its similarity score!

### Some Guidelines

1. **Check Vocabulary**: Initially, ensure the target word is present in the vocabulary. If it’s not, the function should notify the user and not proceed with calculations.
2. **Calculate Similarity**: Use your `my_cosine_similarity` function to compute the similarity between the target word's vector and each vector in the vocabulary.
3. **Track the Highest Score**: As you compute similarities, keep track of the word with the highest similarity score.
4. **Return Results**: After checking all words, return the word with the highest similarity score and the score itself.

Here's an example of how your code will be used!

```python
similar_word, similarity_score = find_most_similar('burger')
if similar_word is not None:
    print(f"The most similar word to 'burger' is '{similar_word}' with a similarity score of {similarity_score:.2f}.")
```


In [None]:
def find_most_similar(target_word):
    # Check if the target word is in the vocabulary dictionary
    if target_word not in vocab_dict:
        print("Word not in dictionary")
        return None, None

    # Retrieve the vector for the target word from the vocabulary dictionary
    vec1 = vocab_dict[target_word]

    # Initialize variables to keep track of the most similar word and the highest similarity score
    most_similar_word = None
    highest_similarity = -np.inf  # Start with the lowest possible similarity

    # Iterate over each word and its vector in the vocabulary dictionary
    for word, vec2 in vocab_dict.items():
        # YOUR CODE HERE: Calculate the similarity using the my_cosine_similarity function
        # Make sure to remove the continue
        continue
    # Return the most similar word along with the similarity score
    return most_similar_word, highest_similarity

In [None]:
#@title Instructor Solution
def find_most_similar(target_word):
    # Check if the target word is in the vocabulary dictionary
    if target_word not in vocab_dict:
        print("Word not in dictionary")
        return None, None

    # Retrieve the vector for the target word from the vocabulary dictionary
    vec1 = vocab_dict[target_word]

    # Initialize variables to keep track of the most similar word and the highest similarity score
    most_similar_word = None
    highest_similarity = -np.inf  # Start with the lowest possible similarity

    # Iterate over each word and its vector in the vocabulary dictionary
    for word, vec2 in vocab_dict.items():
        if word == target_word:
            continue  # Skip comparison with the target word itself

        # Calculate the similarity using the my_cosine_similarity function
        similarity = my_cosine_similarity(vec1, vec2)

        # Update the most similar word and the highest similarity score
        if similarity > highest_similarity:
            highest_similarity = similarity
            most_similar_word = word

    # Return the most similar word along with the similarity score
    return most_similar_word, highest_similarity

### Let's test your function below!

In [None]:
word = "eat" #@param {type:'string'}

similar_word, similarity_score = find_most_similar(word)
if similar_word is not None:
    print(f"The most similar word to '{word}' is '{similar_word}' with a similarity score of {similarity_score:.2f}.")

## Using Word Analogies

We can use the functions we've built to complete word analogies, similar to the examples found [here](http://epsilon-it.utu.fi/wv_demo/). For instance, consider the analogy:

- Breakfast is to bagel as lunch is to ________,

This involves a bit of "word arithmetic". Suppose $A_1$, $A_2$, and $B_1$ are vectors representing three known words. Our task is to find $B_2$ to complete the analogy:

- $A_1$ is to $A_2$ as $B_1$ is to $B_2$.

Intuitively, this implies that the vector difference between $A_1$ and $A_2$ should be the same as the vector difference between $B_1$ and $B_2$. Thus, we can express this relationship mathematically as:

- $A_1 - A_2 = B_1 - B_2$

### Solving for $B_2$:

To find $B_2$, we rearrange the above equation:

- $B_2 = B_1 - (A_1 - A_2)$

This formulation allows us to compute the expected vector for $B_2$ directly by using vector arithmetic. Once we have the vector for $B_2$, we can use our previously developed functions to identify the word whose vector representation is closest to this computed vector. Try it out and explore different analogies!


In [None]:
# Complete the function below!
def find_analogy(word_a1, word_a2, word_b1):
    # Retrieve vectors for each word
    # Use the word2vec function to get the vector for each word
    a1 = word2vec(word_a1)
    a2 = word2vec(word_a2)
    b1 = word2vec(word_b1)

    # Check if any vectors are None (word not in vocabulary)
    # If any of the words are not in the vocabulary, print a message and return None
    if a1 is None or a2 is None or b1 is None:
        missing = [word for word, vec in zip([word_a1, word_a2, word_b1], [a1, a2, b1]) if vec is None]
        print(f"Missing vector for: {', '.join(missing)}")
        return None

    # Calculate the expected vector for b2 based on the analogy
    # The analogy is: word_a1 is to word_a2 as word_b1 is to what word?
    # Calculate vec1 by subtracting the difference between a1 and a2 from b1
    vec1 = b1 - (a1 - a2)

    # Initialize variables to keep track of the most similar word and the highest similarity score
    most_similar_word = None
    highest_similarity = None  # Initialize with None or a very low value

    # Iterate over each word and its vector in the vocabulary dictionary
    # vocab_dict is a dictionary where keys are words and values are their vectors
    for word, vec2 in vocab_dict.items():
        # Skip the current word_b1 to avoid trivial matches

        # Calculate the similarity using the my_cosine_similarity function
        # Your code to calculate similarity goes here

        # Update the most similar word and the highest similarity score if the current word is more similar
        # Your code to update most_similar_word and highest_similarity goes here

    # Return the most similar word along with the similarity score
    return most_similar_word, highest_similarity


In [None]:
#@title Instructor Solution

def find_analogy(word_a1, word_a2, word_b1):

    # Retrieve vectors for each word
    a1 = word2vec(word_a1)
    a2 = word2vec(word_a2)
    b1 = word2vec(word_b1)

    # Check if any vectors are None (word not in vocabulary)

    if a1 is None or a2 is None or b1 is None:
        missing = [word for word, vec in zip([word_a1, word_a2, word_b1], [a1, a2, b1]) if vec is None]
        print(f"Missing vector for: {', '.join(missing)}")
        return None

    # Calculate the expected vector for b2 based on the analogy
    vec1 = b1 - (a1 - a2)

        # Initialize variables to keep track of the most similar word and the highest similarity score
    most_similar_word = None
    highest_similarity = -np.inf  # Start with the lowest possible similarity

    # Iterate over each word and its vector in the vocabulary dictionary
    for word, vec2 in vocab_dict.items():
        if word == word_b1: continue
        # Calculate the similarity using the my_cosine_similarity function
        similarity = my_cosine_similarity(vec1, vec2)

        # Update the most similar word and the highest similarity score
        if similarity > highest_similarity:
            highest_similarity = similarity
            most_similar_word = word

    # Return the most similar word along with the similarity score
    return most_similar_word, highest_similarity

### Let's test your function to see how it does!

In [None]:
worda1 = "cars" #@param {type:'string'}
worda2 = "wheels" #@param {type:'string'}
wordb1 = "birds" #@param {type:'string'}

similar_word, similarity_score = find_analogy(worda1, worda2, wordb1)
if similar_word is not None:
    print(f"The word analogous to '{wordb1}' in the context of '{worda1}' to '{worda2}' is '{similar_word}', with a similarity score of {similarity_score:.2f}.")
else:
    print(f"No analogous word found for '{wordb1}' in the context of '{worda1}' to '{worda2}'.")


Word arithmetic doesn't always work perfectly - it's pretty tricky to find good examples! Which can you discover?

If you're looking for a way to expand further on this exercise, you can try seeing what happens when you use [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance), another common measurement, instead of cosine similarity.