## # TF (Term Frequency) and IDF (Inverse Document Frequency):

**`TF = No. of repitions of a Term in the Sentence/Total no. of Sentences`**

**`IDF = log(No. of Sentences/No. of Sentences containing that t=Term )`**

Using these techniques, our emphasis is on bringing some semantic meaning to the words represented in numerical form.

**Key Intuition**: The key intuition motivating **TF-IDF** is the importance of a term is inversely related to its frequency across documents.

* **`TF`** gives us information on how often a term appears in a document/sentence.


* **`IDF`** gives us information about the **relative rarity** of a term in a collection of documents/sentences.

In [1]:
import nltk

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [2]:
%autosave 30

Autosaving every 30 seconds


In [3]:
para = """Turn data into actionable business insights.
The AWS Data Science team uses the tools our cloud platform provides to unify data preparation, machine learning, and model deployment. We scale the abilities and resources of our customers by delivering advanced functionality for data visualization, feature engineering, model interpretability, and low-latency deployment. Our culture of data-driven decision making requires advanced sales technologies that are timely, accurate, and actionable.

As part of the AWS Data Science team, you’ll discover and solve real-world problems by analyzing large amounts of business data, defining new metrics and business cases, designing simulations and experiments, creating models, and collaborating with colleagues. You’ll bring with you a strong quantitative background and thrive in an environment that leverages statistics, machine learning, operations research, econometrics, and business analysis. And in return, you’ll have the chance to work on some of the world’s largest and diverse datasets.

Learn more about Amazon’s approach to customer-obsessed science on the Amazon Science website, which features the latest news and research from scientists across the company. It’s where you can find information about the conferences we sponsor, the institutions we collaborate with, our awards program, career opportunities, challenges, and more. For the latest updates, subscribe to the monthly newsletter, and follow Amazon Science on LinkedIn, Twitter, Facebook, Instagram, and YouTube.

Interested in AWS? Start here
We’re always glad to connect with talented people. Tell us a bit about what you want to do and we’ll keep you posted on relevant roles and what we’re building at AWS. """

para

'Turn data into actionable business insights.\nThe AWS Data Science team uses the tools our cloud platform provides to unify data preparation, machine learning, and model deployment. We scale the abilities and resources of our customers by delivering advanced functionality for data visualization, feature engineering, model interpretability, and low-latency deployment. Our culture of data-driven decision making requires advanced sales technologies that are timely, accurate, and actionable.\n\nAs part of the AWS Data Science team, you’ll discover and solve real-world problems by analyzing large amounts of business data, defining new metrics and business cases, designing simulations and experiments, creating models, and collaborating with colleagues. You’ll bring with you a strong quantitative background and thrive in an environment that leverages statistics, machine learning, operations research, econometrics, and business analysis. And in return, you’ll have the chance to work on some o

In [4]:
## Tokenization

sentences = nltk.sent_tokenize(para)

In [5]:
## Create an object for Lemmatizing

lemma = WordNetLemmatizer()
lemma

<WordNetLemmatizer>

In [6]:
## Step i. CLean the text

corpus = []  # Cleaned sentences

# Lowering the words
for i in range(len(sentences)):
    review = re.sub('[^a-zA-z]', ' ', sentences[i])  # Replace all chars apart from alphabets w white space
    review = review.lower()
    cleaned_words = nltk.word_tokenize(review)
    
    # Removing Stop Words followed by lemmatizing
    cleaned_words = [lemma.lemmatize(i) for i in cleaned_words if cleaned_words not in stopwords.words('english')]
    cleaned_sent = " ".join(cleaned_words)
#     print(cleaned_words)
#     print("--")
    corpus.append(cleaned_sent)

In [7]:
## Let's compare the original sentences against the cleaned ones

[print(f'{i}\n{j}', "\n--") for i, j in zip(sentences, corpus)]

Turn data into actionable business insights.
turn data into actionable business insight 
--
The AWS Data Science team uses the tools our cloud platform provides to unify data preparation, machine learning, and model deployment.
the aws data science team us the tool our cloud platform provides to unify data preparation machine learning and model deployment 
--
We scale the abilities and resources of our customers by delivering advanced functionality for data visualization, feature engineering, model interpretability, and low-latency deployment.
we scale the ability and resource of our customer by delivering advanced functionality for data visualization feature engineering model interpretability and low latency deployment 
--
Our culture of data-driven decision making requires advanced sales technologies that are timely, accurate, and actionable.
our culture of data driven decision making requires advanced sale technology that are timely accurate and actionable 
--
As part of the AWS Dat

[None, None, None, None, None, None, None, None, None, None, None, None, None]

### # Creating the TF-IDF model:

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

tfidf_vec = TfidfVectorizer()
X = tfidf_vec.fit_transform(corpus)

In [9]:
# Our TF-IDF vectorized features

X = X.toarray()
X = pd.DataFrame(X)
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,145,146,147,148,149,150,151,152,153,154
0,0.0,0.0,0.0,0.0,0.396865,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.236662,0.0,0.0,0.0,0.0,0.204088,0.0,0.0,0.0,0.0,...,0.163051,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.269669,0.0,0.232553,0.232553,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.17667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.121719,0.0,0.152354,0.110785,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.227176,...,0.0,0.0,0.0,0.0,0.0,0.156516,0.0,0.0,0.284911,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.27413,0.2364,0.171899,0.0
7,0.0,0.160568,0.0,0.209973,0.0,0.0,0.0,0.362146,0.0,0.0,...,0.0,0.209973,0.0,0.0,0.209973,0.0,0.0,0.0,0.0,0.0
8,0.0,0.17156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.309132,0.0,0.0,0.224347,0.0,0.154566,0.0,0.0,0.140681,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.220167,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.255306
