# Bag of Words Project

In this section, we are going to explore creating a bag of words on a messier, more real-world dataset. We are going to take some Tesla customer reviews and create a bag of words off of them using the techniques we learned earlier. While we could use a much larger dataset across all car brands, this might be a sufficient for our purposes just looking at one brand. The main objective here is to learn how to handle multiple documents and integrate our knowledge with what we know so far. 

This `tesla_car_reviews.csv` dataset is a CSV, so it will be easiest to open it first with pandas. We will also set some configurations to show all rows and not truncate them. 

In [None]:
import pandas as pd 
import urllib.request

pd.set_option('display.max_rows', None) # Don't hide any rows 

urllib.request.urlretrieve(r"https://raw.githubusercontent.com/thomasnield/anaconda_data_preparation_llm/refs/heads/main/tesla_car_reviews.csv", 'tesla_car_reviews.csv')

df = pd.read_csv('tesla_car_reviews.csv')
df

Let's say we are interested in building a large language model or doing some sentiment analysis on these reviews. Let's bring our attention to the `Review_Title` and `Review` columns. We will concatenate the two together but treat each pair as a separate document. Let's do that and print each one. 

In [None]:
titles = df["Review_Title"].to_list()
reviews = df["Review"].to_list()
docs = [f"{t}. {r}\n" for t,r in zip(titles, reviews)]

for doc in docs: 
    print(doc)

## Cleaning and Tokenizing the Data 

Study the output for one moment. What do you notice? Here are some of my observations: 

* There are emojis in the data.
* This is user data, so there are a lot of typos, grammar, slang, and unconventional formatting.
* There are numbers, model names, and other nonstandard tokens.
* There is a lot of messy and improper punctuation, as well as our own we introduced.
* There is improper and inconsistent use of capitalization.

Let's say we are interested in doing sentiment analysis, where we are developing a model that predicts a review being positive or negative based on the words. This might mean we remove some stop words, and perhaps stem or lemmatize words. 

Whether we clean the data from scratch, use NLTK, or spaCy we will need to treat each document individually. Let's say we settle on using spaCy. We will turn each document into a spaCy document. 

In [None]:
import spacy 
nlp = spacy.load("en_core_web_sm")

spacy_docs = [nlp(doc) for doc in docs]

for spacy_doc in spacy_docs: 
    print(spacy_doc)

Within each document, we traverse each sentence. Below, we take the first document and iterate each of the sentences. 

In [None]:
for sent in spacy_docs[0].sents: 
    print(sent)

We can also iterate each individual word. 

In [None]:
for token in spacy_docs[0]: 
    print(token)

If we want to filter out punctuation, stopwords, and non-alphabetic words we can repackage in a list comprehension and filter those out. We will also make the words lowercase and pull out their lemmas. There are more robust ways to do this task with tokenizers in spaCy but we will stick with this re-concatenation approach for now. 

In [None]:
for spacy_doc in spacy_docs: 
    cleaned_doc = nlp(' '.join([token.lemma_.lower() for token in spacy_doc if token.is_alpha and not token.is_stop]))
    print(f"{cleaned_doc}\n")

Let's package these cleaned docs in a new list. Let's also make the docs strings again, rather than converting to spaCy docs using `nlp()`. While we could keep using spaCy, we will switch to scikit-learn. 

In [None]:
cleaned_docs = [' '.join([token.lemma_.lower() for token in spacy_doc if token.is_alpha and not token.is_stop])
                for spacy_doc in spacy_docs]

We could go a step further and try to correct typos, and there are many Python libraries to assist with this. Spell-checking and correcting can be a tedious process, and you may decide to omit anything that is not a dictionary word rather than go through each candidate corrective word for a typo. It really depends on the task but for our case of preparing for sentiment analysis, this should be fine. 

## Vectorizing the Data

### CountVectorizer

Next, let's build a `CountVectorizer` off these cleaned docs. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

# tokenize 
vectorizer.fit(cleaned_docs)

# show vocabulary
print(vectorizer.vocabulary_)

Keep in mind the vocabulary is only going to be limited to what the vocabulary the reviews contain. This means any new words we introduce in a new vector will simply be omitted. For example, if we create a vector off the review "My Tesla is the best!" the words "my", "is", and "the" are ignored because these were all stopwords that were removed. We should only see a "1" for the words "tesla" and "best". 

In [None]:
import numpy as np 
np.set_printoptions(threshold=np.inf) # don't truncate vector outputs 

# encode a new document
vector = vectorizer.transform(["My Tesla is the best!"])

# summarize vector
vector.toarray()

### TF-IDF Vectorizer

The interesting thing about using the `TfidVectorizer` is that we do not necessarily need to omit stop words from the documents. Since this vectorizer scores words by how "surprising" they are and uncommon in the docs, we can trust the stop words like "the" will be scored low. However, words like "best" will stand out more and be scored higher. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

cleaned_docs = [' '.join([token.lemma_.lower() for token in spacy_doc if token.is_alpha])
                for spacy_doc in spacy_docs]

vectorizer = TfidfVectorizer()

# tokenize 
vectorizer.fit(cleaned_docs)

# show vocabulary and scores 
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

When we create a vector for the review "My Tesla is the best!" note that we get a four words scored higher than zero. The word "best" scored really high at approximately `0.8522`. Other words like the stop word and "Tesla" scored low at `0.3144930734482443`. It makes sense that "Tesla" scored low because the reviews are about Tesla, and therefore is going to be mentioned by name frequently. 

In [None]:
# encode a new document
vector = vectorizer.transform(["My Tesla is the best!"])

# summarize vector
vector.toarray()

### Hashing Vectorizer

Finally, we can use a `HashingVectorizer` if we know this is a one-way conversion. Since we talked about sentiment analysis being the application and that is a prediction system, this might be appropriate. 

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(n_features=2**10) # default n_features = 2**20

# tokenize 
vectorizer.fit(cleaned_docs)

# encode a new document
vector = vectorizer.transform(["My Tesla is the best!"])

# summarize vector
vector.toarray()

## Conclusion

We learned to scale up our knowledge and integrate the cleaning, tokenization, and vectorization processes with multiple documents. Of course, we dealt with documents packaged in a pandas DataFrame but you can extend this approach to processing separate files and other data sources. This was still a relatively small dataset as well, but it was sizable enough to get a taste of the data preparation process. 

## Exercise

svg image

We just collected and processed customer reviews about Tesla cars. If we were trying to build applications that predict whether each review is favorable or not based on the text, what are some bigger questions we need to consider when it comes to the quality of our data? **HINT: Think about who is creating the data and where the data is coming from.**

### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

A big question we need to think about is the *bias* that exists in our data. Here are some possibilities: 

* Are happier customers, or unhappy customers, more likely to create reviews?
* Do we need to have a proportional amount of happy versus unhappy customers so both groups are represented?
* Does the review site tend to attract one type of reviewer more than others?
* Are the reviews verified purchases? Or can they be planted by bots, employees, die-hard fans, and less-than-sincere sources?

It is very easy to get caught up in the data and processing it, but do not forget to ask where it came from and what created it! 