## # Bag of Words

The **`bag-of-words`** model is a simplifying representation used in **natural language processing** and **information retrieval (IR)**. 

In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision.

In [1]:
import nltk

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [2]:
para = """Turn data into actionable business insights.
The AWS Data Science team uses the tools our cloud platform provides to unify data preparation, machine learning, and model deployment. We scale the abilities and resources of our customers by delivering advanced functionality for data visualization, feature engineering, model interpretability, and low-latency deployment. Our culture of data-driven decision making requires advanced sales technologies that are timely, accurate, and actionable.

As part of the AWS Data Science team, you’ll discover and solve real-world problems by analyzing large amounts of business data, defining new metrics and business cases, designing simulations and experiments, creating models, and collaborating with colleagues. You’ll bring with you a strong quantitative background and thrive in an environment that leverages statistics, machine learning, operations research, econometrics, and business analysis. And in return, you’ll have the chance to work on some of the world’s largest and diverse datasets.

Learn more about Amazon’s approach to customer-obsessed science on the Amazon Science website, which features the latest news and research from scientists across the company. It’s where you can find information about the conferences we sponsor, the institutions we collaborate with, our awards program, career opportunities, challenges, and more. For the latest updates, subscribe to the monthly newsletter, and follow Amazon Science on LinkedIn, Twitter, Facebook, Instagram, and YouTube.

Interested in AWS? Start here
We’re always glad to connect with talented people. Tell us a bit about what you want to do and we’ll keep you posted on relevant roles and what we’re building at AWS. """

para

'Turn data into actionable business insights.\nThe AWS Data Science team uses the tools our cloud platform provides to unify data preparation, machine learning, and model deployment. We scale the abilities and resources of our customers by delivering advanced functionality for data visualization, feature engineering, model interpretability, and low-latency deployment. Our culture of data-driven decision making requires advanced sales technologies that are timely, accurate, and actionable.\n\nAs part of the AWS Data Science team, you’ll discover and solve real-world problems by analyzing large amounts of business data, defining new metrics and business cases, designing simulations and experiments, creating models, and collaborating with colleagues. You’ll bring with you a strong quantitative background and thrive in an environment that leverages statistics, machine learning, operations research, econometrics, and business analysis. And in return, you’ll have the chance to work on some o

**NOTE:** A **stop word** can include white space characters, but it cannot include punctuation characters, such as a comma or vertical bar.

In [3]:
## Tokenization

sentences = nltk.sent_tokenize(para)

In [4]:
## Create an object for Lemmatizing

lemma = WordNetLemmatizer()
lemma

<WordNetLemmatizer>

In [5]:
## Step i. CLean the text

corpus = []  # Cleaned sentences

# Lowering the words
for i in range(len(sentences)):
    review = re.sub('[^a-zA-z]', ' ', sentences[i])  # Replace all chars apart from alphabets w white space
    review = review.lower()
    cleaned_words = nltk.word_tokenize(review)
    
    # Removing Stop Words followed by lemmatizing
    cleaned_words = [lemma.lemmatize(i) for i in cleaned_words if cleaned_words not in stopwords.words('english')]
    cleaned_sent = " ".join(cleaned_words)
#     print(cleaned_words)
#     print("--")
    corpus.append(cleaned_sent)

In [6]:
corpus

['turn data into actionable business insight',
 'the aws data science team us the tool our cloud platform provides to unify data preparation machine learning and model deployment',
 'we scale the ability and resource of our customer by delivering advanced functionality for data visualization feature engineering model interpretability and low latency deployment',
 'our culture of data driven decision making requires advanced sale technology that are timely accurate and actionable',
 'a part of the aws data science team you ll discover and solve real world problem by analyzing large amount of business data defining new metric and business case designing simulation and experiment creating model and collaborating with colleague',
 'you ll bring with you a strong quantitative background and thrive in an environment that leverage statistic machine learning operation research econometrics and business analysis',
 'and in return you ll have the chance to work on some of the world s largest and

In [7]:
## Let's compare the original sentences against the cleaned ones

[print(f'{i}\n{j}', "\n--") for i, j in zip(sentences, corpus)]

Turn data into actionable business insights.
turn data into actionable business insight 
--
The AWS Data Science team uses the tools our cloud platform provides to unify data preparation, machine learning, and model deployment.
the aws data science team us the tool our cloud platform provides to unify data preparation machine learning and model deployment 
--
We scale the abilities and resources of our customers by delivering advanced functionality for data visualization, feature engineering, model interpretability, and low-latency deployment.
we scale the ability and resource of our customer by delivering advanced functionality for data visualization feature engineering model interpretability and low latency deployment 
--
Our culture of data-driven decision making requires advanced sales technologies that are timely, accurate, and actionable.
our culture of data driven decision making requires advanced sale technology that are timely accurate and actionable 
--
As part of the AWS Dat

[None, None, None, None, None, None, None, None, None, None, None, None, None]

In [8]:
## Step ii. Create the Bag of Words

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()

In [9]:
X  # Our Bag of Words

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 2, 0]], dtype=int64)

In [10]:
X.shape

(13, 155)