 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

# `Lexical characteristics`

* process of segmenting text into lexical expressions

* converting text into base word representations

**Different types of representations:**
   <br>
   
   * `word level representations`
     <br>
   
   * `sentence level representations`
     <br>
   
   * `document level representations`
     <br>
   
   * `dense representations`
     <br>
   
   * `embeddings`

**`Dense representations` and `embeddings` are more complex topics, so we will cover them later on.**

## `Word level representations`

* after a **`morpheme`**, a word is the most elementary part of some text

* often the first thing we do when processing text is separate it into words
    * there are exceptions (e.g. when dealing with extremely noisy data)

### `Tokenization`

* separating text into elements that hold some meaning

* tokens are most often one of the following:
  <br>
  
  * words
   <br>
  
  * numbers
   <br>
  
  * punctuation marks


* tokens can also be multiple words
    <br>
    
    * e.g. "the one and only"
    * it all depends on how we define what a token is when we tokenize our data

**Example before tokenization:**

"The dog sits on the porch."

**Example after tokenization:**

"The" 

"dog"  

"sits"  

"on" 

"the" 

"porch"  

"."

**Creating tokens:**

<br>

* can be problematic
<br>

    * some languages don't use empty spaces, and using punctuation marks is problematic for defining tokens because of their ambiguity
    
<br>

* sometimes text needs to be preprocessed before tokenization

**`Stop words`**

* tokens that appear most often
    <br>
    
    * e.g. as, of, the, etc.
    
    
* these words tipically provide no context, so they are usually removed to improve performance



* a list of commonly removed words is called a **`stop word list`**


* **IMPORTANT:** always convert all of your text to lowercase before trying to remove stopwords
<br>
    
    * **stopwords corpora** are all in lowercase so you might miss some word if you run into the uppercase version of it



**Example before removing stop words:**

"The"  , "dog"  , "sits"  , "on"  , "the" , "porch"  ,  "."

**Example after removing stopwords:**

"The", "dog", "sits", "porch", "."

* we can also remove punctuation marks, special characters and numbers if we want
    <br>
    
    * modify the stop words corpus so that it includes those values, so they get removed when we remove stop words (do this before you remove stopwords)
    
    * leverage the power of **`RegEx`**

# Tokenizing text into words using `NLTK`

* **`NLTK`** uses **regular expressions** to tokenize text

* we use the **`word_tokenize()`** method 

* example sentence that we need to tokenize : **"The dog sits on the porch."**

In [6]:
import nltk

In [7]:
# Import the library from nltk

from nltk.tokenize import word_tokenize

In [8]:
# Create example text data

text = "The dog sits on the porch."

In [151]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\swaheed\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
# Perform tokenization

tokenized_text = word_tokenize(text)

In [10]:
print(tokenized_text)

['The', 'dog', 'sits', 'on', 'the', 'porch', '.']


**`RegexpTokenizer`**

* a lot more flexible than the standard **`word_tokenize()`** method (e.g. allows us to exclude punctuation marks) 


* allows us to precisely define how we want to tokenize the text


* **however, you must be familiar with regex**
    * checkout the bonus <a href="Bonus_material_RegEx.ipynb">Bonus_material_RegEx.ipynb</a> notebook if you need a refresher on Python regex

In [11]:
# Import the tokenizer

from nltk.tokenize import RegexpTokenizer

In [12]:
# Define the tokenizer parameters

tokenizer = RegexpTokenizer("[\w']+")

In [13]:
# Tokenize text data

tokenized_text = tokenizer.tokenize(text)

In [14]:
print(tokenized_text)

['The', 'dog', 'sits', 'on', 'the', 'porch']


**`ToktokTokenizer`**

* much faster than the default tokenizer


* tested and proven for English, Persian, Russian, Czech, French, German and a few other languages

In [15]:
# Import the brown corpus 
# Import the time module 

from nltk.corpus import brown
import time

In [18]:
nltk.download('brown')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\swaheed\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


True

In [16]:
# Standard tokenization on brown corpus (corpus of 1.4 million words)

brown_corpus = brown.raw()

start = time.time()

tokenized_text = word_tokenize(brown_corpus)

end = time.time() - start

In [17]:
print(end)

9.96935486793518


In [1]:
# Import the Toktok tokenizer

from nltk.tokenize import ToktokTokenizer

In [3]:
tok_tok_tokenizer = ToktokTokenizer()
tokenized_text = tok_tok_tokenizer.tokenize("this is an example")

In [19]:
# Toktok tokenization on brown corpus (corpus of 1.4 million words)

brown_corpus = brown.raw()

tok_tok_tokenizer = ToktokTokenizer()

start = time.time()

tokenized_text = tok_tok_tokenizer.tokenize(brown_corpus)

end = time.time() - start

In [20]:
print(end)

2.1360223293304443


### Exercise 1

**Tokenize the following sentence while also removing punctuation marks:** 

```
"Do not go where the path may lead, go instead where there is no path and leave a trail."
```
 

### Solution

In [21]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
text = "Do not go where the path may lead, go instead where there is no path and leave a trail."
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)


['Do', 'not', 'go', 'where', 'the', 'path', 'may', 'lead', 'go', 'instead', 'where', 'there', 'is', 'no', 'path', 'and', 'leave', 'a', 'trail']


In [22]:
tokenizer = RegexpTokenizer("[\w']+")

### Exercise 2

**Tokenize the following text using the fastest tokenizer from `NLTK`:**

```
"""
Our knowledge has made us cynincal.
Our cleverness, hard and unkind.
We think too much, and feel too little.
More than machinery, we need humanity.
More that cleverness, we need kindness and gentleness.
Without these qualities life will be violent, and all will be lost.
"""
```

### Solution

In [23]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
text = """
Our knowledge has made us cynincal.
Our cleverness, hard and unkind.
We think too much, and feel too little.
More than machinery, we need humanity.
More that cleverness, we need kindness and gentleness.
Without these qualities life will be violent, and all will be lost.
"""
tokenized_text = tok_tok_tokenizer.tokenize(text)
print(tokenized_text)


['Our', 'knowledge', 'has', 'made', 'us', 'cynincal.', 'Our', 'cleverness', ',', 'hard', 'and', 'unkind.', 'We', 'think', 'too', 'much', ',', 'and', 'feel', 'too', 'little.', 'More', 'than', 'machinery', ',', 'we', 'need', 'humanity.', 'More', 'that', 'cleverness', ',', 'we', 'need', 'kindness', 'and', 'gentleness.', 'Without', 'these', 'qualities', 'life', 'will', 'be', 'violent', ',', 'and', 'all', 'will', 'be', 'lost', '.']


# Removing stopwords using `NLTK`

* **`NLTK`** has a corpus of stopwords included into it

* we can use that corpus, but we can also add extra words it if we want to

* **IMPORTANT: remember to lowercase your text, otherwise, some stopwords may be missed!**

* let's demonstrate on the following sentence

**"Our virtues and our failings are inseparable, like force and matter. When they separate, man is no more."**

## Example

In [24]:
# Import the stopwords list
# and word_tokenize

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [41]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\swaheed\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [25]:
# Define stop words

stop_words = stopwords.words("english")

In [26]:
print(len(stop_words))

179


In [27]:
# Define example text data

text = "Our virtues and our failings are inseparable, like force and matter. When they separate, man is no more."

In [28]:
# Tokenize text

tokenized_text = word_tokenize(text.lower())

In [29]:
# Create a for loop that removes stopwords
# and store the result in a new list

text_stopwords_removed = []

for word in tokenized_text:
    if word not in stop_words:
        text_stopwords_removed.append(word)

In [30]:
print(text_stopwords_removed)

['virtues', 'failings', 'inseparable', ',', 'like', 'force', 'matter', '.', 'separate', ',', 'man', '.']


**Better solution:**

* instead of using loops, make your code reusable - use functions

In [31]:
# Create a function that removes stopwords
# and also lets users pick the stop words corpus

def remove_stopwords(text, language):
    stop_words = stopwords.words(language)
    return [w for w in text if not w.lower() in stop_words]

In [32]:
print(remove_stopwords(tokenized_text, language="english"))

['virtues', 'failings', 'inseparable', ',', 'like', 'force', 'matter', '.', 'separate', ',', 'man', '.']


**If we want to add something to the corpus (e.g. punctuation) we just need to add it to the list of stopwords**

In [33]:
# Add to the stopwords list

stop_words.extend([".", ","])

In [34]:
# Remove stopwords

text_stopwords_removed = []

for word in tokenized_text:
    if word not in stop_words:
        text_stopwords_removed.append(word)

In [35]:
print(text_stopwords_removed)

['virtues', 'failings', 'inseparable', 'like', 'force', 'matter', 'separate', 'man']


**NOTE:** Sometimes you might want to combine different corpora of stopwords together. If you do that, be sure to increase the efficiency of your code by getting rid of the duplicates (use a set instead of a list for removing stopwords).

### Exercise 3

**Create a function that removes all stop words and punctuation marks from a sentence. Use it on the following sentence:**

```
"If you set your goals ridiculously high and it's a failure, you will fail above everyone else's success."
```

### Solution

In [36]:
def remove_s(text):
    tokenized_text = tok_tok_tokenizer.tokenize(text)
    stop_words.extend("'")
    return [w for w in tokenized_text  if w not in stop_words]

In [37]:
text_is = "If you set your goals ridiculously high and it's a failure, you will fail above everyone else's success."
remove_s(text_is)

['If',
 'set',
 'goals',
 'ridiculously',
 'high',
 'failure',
 'fail',
 'everyone',
 'else',
 'success']

# `Sentence level representations`

* before analyzing each sentence separately, we often need to separate a large chunk of text into sentences

* similar procedure to how we separate words inside a sentence

* let's tokenize the following text into sentences

### Example

* let's tokenize the following text into sentences

**"Machine Learning can usually be divided into classic Machine Learning and Deep Learning. Classic Machine Learning is easier to understand. Deep Learning on the other hand is a bit more complex."**

* the result we will get by tokenizing it into sentences will look like this
    
    

**['Machine Learning can usually be divided into classic Machine Learning and Deep Learning.',
'Classic Machine Learning is easier to understand.',
'Deep Learning on the other hand is a bit more complex.']**

## Separating sentences into `n-grams`

* grasping grammar or context from single tokens (especially when they are just words) is very hard

* that is a very common problem when working with **`bag-of-words`** models with **`unigrams` (one phrase = one token)**

* solution: **work with phrases that contain multiple tokens (`n-grams`)**
    <br>
    
    * **NOTE:** using n-grams where n is a large number is rare

**Example of a `bigram` (one phrase = two tokens):**

* "metal boat" is not the same as "boat metal"


* looking at a bigram allows us to capture that distinction

* we can even work with whole sentences, however using very big **`n-grams`** is not recommended (we quickly start working with whole sentences)

## Tokenizing documents into sentences using `NLTK`

* we use the **`sent_tokenize()`** method

**Let's use the following text for the purposes of demonstration.**

"Machine Learning can usually be divided into classic Machine Learning and Deep Learning. Classic Machine Learning is relatively easy to understand. Deep Learning on the other hand is a bit more complex."

In [38]:
# Import the sent_tokenize method

from nltk.tokenize import sent_tokenize

In [39]:
# Define example text data

text = """Machine Learning can usually be divided into classic Machine Learning and Deep Learning. 
Classic Machine Learning is relatively easy to understand. Deep Learning on the other hand is a bit more complex."""

In [40]:
# Tokenize text data

tokenized_text = sent_tokenize(text)

In [41]:
print(tokenized_text)

['Machine Learning can usually be divided into classic Machine Learning and Deep Learning.', 'Classic Machine Learning is relatively easy to understand.', 'Deep Learning on the other hand is a bit more complex.']


## Separating sentences into `n-grams` in NLTK

* to separate sentences into `n-grams` in NLTK, we use the `ngrams()` method

In [42]:
# Import the ngrams method

from nltk import ngrams

In [43]:
# Create an example sentence
# The sentence must be tokenized

example_sentence = ["Deep", "Learning", "on", "the", "other", "hand", "is", "a", "bit", "more", "complex"]

In [44]:
# Define n

n = 3

# Separate sentence into trigrams

trigrams = ngrams(example_sentence, n)

In [45]:
type(trigrams)

zip

In [46]:
# Store trigrams inside a list

trigram_list = []

for phrase in trigrams:
    trigram_list.append(phrase)
    

In [47]:
# Take a look at the list of trigrams

trigram_list

[('Deep', 'Learning', 'on'),
 ('Learning', 'on', 'the'),
 ('on', 'the', 'other'),
 ('the', 'other', 'hand'),
 ('other', 'hand', 'is'),
 ('hand', 'is', 'a'),
 ('is', 'a', 'bit'),
 ('a', 'bit', 'more'),
 ('bit', 'more', 'complex')]

# `Document level representations`

* a document can be any text file (log file, etc.)

* we can use sparse or dense representations for the text in a document

* we can represent a document (or a set of documents) as a **`document-term matrix`**
    <br>
    
    * **columns = tokens**
    <br>
    
    * **rows = documents in the set of documents**

**Two common representations (ways of creating the document-term matrix):**
    <br>
    
   * `Bag-of-words`
   <br>
    
   * `TFIDF`

## `Bag-of-words`

<img src="https://edlitera-images.s3.amazonaws.com/bag_of_words_example_image.png" width="500"/>


* text is represented as a **list of counts of its unique words**

* converting a set of documents into a **`document-term matrix`** where we define each element as equal to the number of times it occurs in a document
    * this procedure is known as **`count vectorization`**

* used for feature generation

* very important to include a **stop word filter** as preprocessing
    * since words are weighed based on how many times they appear, stop words would otherwise gain too much importance 

* one very big problem with this method: sometimes even the words that appear rarely in a document can be important
    * we will address this in a bit

# `Bag-of-words` using `Scikit-Learn`

* typically **count-vectorization** can be done using the **`Scikit-Learn`** library

* we will take a look at an example now, but we will leave explaining the **`Scikit-Learn`** library in detail for later

* `pandas` (https://pandas.pydata.org/) is an excellent package for doing data processing in Python


* we teach an entire course on it, if you're interested in more in-depth Python data processing!


* a pandas `DataFrame` is a container for storing tabular data
    * it's like an Excel sheet, with rows and columns

### Example

In [48]:
# Import needed libraries

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [49]:
# Define example text data

texts = ["good movie", "bad movie", "did not like it", "good one"]

In [50]:
# Define vectorizer

count = CountVectorizer()

In [51]:
# Apply vectorizer to data

features = count.fit_transform(texts)

features

<4x8 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [52]:
# Convert a sparse matrix into a dense matrix

dense_features = features.todense()

dense_features

matrix([[0, 0, 1, 0, 0, 1, 0, 0],
        [1, 0, 0, 0, 0, 1, 0, 0],
        [0, 1, 0, 1, 1, 0, 1, 0],
        [0, 0, 1, 0, 0, 0, 0, 1]], dtype=int64)

In [53]:
# Get column names
# Columns are our features

column_names = count.get_feature_names()

column_names

['bad', 'did', 'good', 'it', 'like', 'movie', 'not', 'one']

In [54]:
# Create Pandas dataframe to show result

pd.DataFrame(
    features.todense(), 
    columns=count.get_feature_names(), 
    index=texts
)

Unnamed: 0,bad,did,good,it,like,movie,not,one
good movie,0,0,1,0,0,1,0,0
bad movie,1,0,0,0,0,1,0,0
did not like it,0,1,0,1,1,0,1,0
good one,0,0,1,0,0,0,0,1


### Exercise 4

**Create a `Bag-of-words` representation of the following two sentences using a tokenizer from `NLTK`:**

```
string_1 = "Let us demonstrate the bag of words technique."

string_2 = "And the TFIDF technique."
```

### Solution

In [55]:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd


def gen_token(text):
    #text = "Let us demonstrate the bag of words technique."
    #ext = text.split(' ')
    text = word_tokenize(text.lower())
    text = [w for w in text if w not in stop_words]

    count = CountVectorizer()
    features = count.fit_transform(text)

    dense_features = features.todense()

    
    df = pd.DataFrame(
        features.todense(), 
        columns=count.get_feature_names(), 
        index=text)
    return df
    
text1 = "Let us demonstrate the bag of words technique."
text2 = "And the TFIDF technique."

df1 = gen_token(text1)
print(df1)
df2 = gen_token(text2)
print(df2)


             bag  demonstrate  let  technique  us  words
let            0            0    1          0   0      0
us             0            0    0          0   1      0
demonstrate    0            1    0          0   0      0
bag            1            0    0          0   0      0
words          0            0    0          0   0      1
technique      0            0    0          1   0      0
           technique  tfidf
tfidf              0      1
technique          1      0


## `TFIDF`

<img src="https://edlitera-images.s3.amazonaws.com/TFIDF_example_image.png" width="500"/>

* solves the main problem of the **`Bag-of-words`** method

* rare words are given greater weigth by multiplying the **`term frequency (TF)`** with the **`inverse document frequency (IDF)`** 

* **`term frequency (TF)`** - how many times a word appears in some document

    <br>
    
    * the **`TF`** factor increases the **`TFIDF`** value of a token proportionally to the number of time a term occurs in the set of documents

* **`inverse document frequency (IDF)`** - the inverse of **`document frequency`**
    <br>
    
    * without going into math, the idea behind **`IDF`** is that the terms that appear more frequently in a collection of documents are less informative than those that appear less often
    <br>
    
    * this part reduces the **`TFIDF`** value of a token

**This is THE most popular weighting method for creating document matrices.**

* usually performed using the **`Scikit-Learn`** library (https://scikit-learn.org/stable/)
    * we will take a look at an example now, but we will go over the **`Scikit-Learn`** library a bit later

### Example:

In [56]:
# Import needed libraries

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [57]:
# Define example text data

texts = ["good movie", "bad movie", "did not like it", "good one"]

In [58]:
# Define vectorizer

tfidf = TfidfVectorizer()

In [59]:
# Apply vectorizer to data

features = tfidf.fit_transform(texts)

In [60]:
# Create Pandas dataframe to demonstrate result

pd.DataFrame(
    features.todense(), 
    columns=tfidf.get_feature_names(),
    index=texts
)

Unnamed: 0,bad,did,good,it,like,movie,not,one
good movie,0.0,0.0,0.707107,0.0,0.0,0.707107,0.0,0.0
bad movie,0.785288,0.0,0.0,0.0,0.0,0.61913,0.0,0.0
did not like it,0.0,0.5,0.0,0.5,0.5,0.0,0.5,0.0
good one,0.0,0.0,0.61913,0.0,0.0,0.0,0.0,0.785288


### Exercise 5

**Create a  `TFIDF` representation of the first two sentences of the following string using a tokenizer from `NLTK`:**

```
text = "I drive a red car. Bill drives a blue car. Jack drives a yellow car. "

```

### Solution

In [61]:
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = stopwords.words("english")
def gen_token(text):
    #text = "Let us demonstrate the bag of words technique."
    #ext = text.split(' ')
    text = sent_tokenize(text.lower())
    text = text[0:2]
    #text = [w for w in text if w not in stop_words]
    print(text)
    tfidf = TfidfVectorizer()
    features = tfidf.fit_transform(text)

    dense_features = features.todense()
    print(dense_features)
    print(tfidf.get_feature_names())
    
    df = pd.DataFrame(
        features.todense(), 
        columns= tfidf.get_feature_names(), 
        index=text)
    return df
text = "I drive a red car. I want to drive a blue car. Jack drives a yellow car. "
gen_token(text)

['i drive a red car.', 'i want to drive a blue car.']
[[0.         0.50154891 0.50154891 0.70490949 0.         0.        ]
 [0.49922133 0.35520009 0.35520009 0.         0.49922133 0.49922133]]
['blue', 'car', 'drive', 'red', 'to', 'want']


Unnamed: 0,blue,car,drive,red,to,want
i drive a red car.,0.0,0.501549,0.501549,0.704909,0.0,0.0
i want to drive a blue car.,0.499221,0.3552,0.3552,0.0,0.499221,0.499221


# `Contractions`

* very important when working with text in English
    * text often contains contractions, and most Python libraries won't know how to deal with them


* using something like **`RegEx`** allows us to deal with that problem

* in real world applications, users will often have a good idea of whether their code will run into contractions or not (and also what are some potential contractions it might run into)

* other languages might have their own, similar quirks so it is always a good idea to include some **`RegEx`** into  preprocessing to deal with certain parts of text you know your model will have a hard time working with

* if you need a refresher on **`RegEx`**, take a look at the **`RegEx`** notebook in the bonus material

## Example

In [62]:
# Import needed libraries

import re

# Create function that removes contractions

def decontracted(phrase):
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    phrase = re.sub(r"\'all", "ou all", phrase)
    return phrase

In [63]:
# Create example text data

text = "I'm finding it hard to believe that wasn't her real name. I simply can't believe it."

In [64]:
# Display result

decontracted(text)

'I am finding it hard to believe that was not her real name. I simply can not believe it.'

### Exercise 6

**Create a function that will get rid of the following contractions:**
    
* Y'all
* I'm
* didn't

**Apply that function on the following sentenceand print the result:** 

```
"Y'all should know I'm quite sure you didn't complete the project yet."
```

### Solution

In [65]:
t1= "Y'all"
t2 = "I'm"
t3 = "didn't"
decontracted(t1)

'You all'

# `Lexical Characteristics Cheat Sheet`

* converting text into base word representations

**Different types of representations:**
   <br>
   
   * `word level representations`
     <br>
   
   * `sentence level representations`
     <br>
   
   * `document level representations`
     <br>
   
   * `dense representations`
     <br>
   
   * `embeddings`

### `Word level representations`

**`Word level tokenization`**

* separating text into elements that hold some meaning (words when talking about word level representations)


* be careful: 
    * try to remove stop words, punctuation and special symbols before tokenizing
    * simple procedure: load a corpus of stopwords and remove every word from your text that is present in your stop words corpus


* performed using tokenizers
    * in NLTK we tipically use the **`default tokenizer`**, the **`TokTok Tokenizer`** and the **`Regexp Tokenizer`**

### `Sentence level representations`

**`Sentence level tokenization`**

* splitting larger text into smaller components (e.g. sentences)


* be careful: 
    * try to remove stop words, punctuation and special symbols before tokenizing
    * simple procedure: load a corpus of stopwords and remove every word from your text that is present in your stop words corpus


* performed using the **`sent_tokenize()`** method from **`NLTK`**

### `Document level representations`

* we can represent some document (or a set of documents) as a **`document-term matrix`**
    <br>
    
    * **columns = documents in the set of documents**
    <br>
    
    * **rows = tokens**

**Two common representations (ways of creating the document-term matrix):**
    <br>
    
   * `Bag-of-words`
   <br>
    
   * `TFIDF`

**`Bag-of-words`**

* converting a set of documents into a **`document-term matrix`** where we define each element as equal to the number of times it occures in some document


* not very good because it ignores the importance of rare words


* implementation - **`CountVectorizer`** from **`Scikit-learn`**

**`TFIDF`**

* gives rare words greater weight by multiplying the **`term frequency (TF)`** of the document with the **`inverse document frequency (IDF)`** 


* **`TF`** - how many times a word appears in some document


* **`IDF`** - the terms that appear more frequently in a collection of documents are less informative than those that appear less often


* implementation - **`TfidfVectorizer`** from **`Scikit-learn`**

**Bonus - `Contractions`**

* removing contractions is important for English
    * do this before tokenizing your data


* if you want to get rid of contractions:
    * create a function using **`RegEx`**



 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>