## <div align="center"> Text processing pipeline </div>
Clean and prepare text for classification tasks and others.

<hr/>

### Use cases
- Sentiment analysis
- Text summarization
- Machine translation

<img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/df752feb-6081-4318-a3a5-125a1d4c68d4" height="600"/>

<hr/>

## <div align ="center">  Pipeline of handling text data sets </div>

<div align ="center"> 
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/b7baeb0c-92dc-4cd1-b99a-8dbe9bb7f9d0" height="200">
</div>

<hr/>

## <div align ="center"> Tools </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/c15b4b04-fc4c-4d09-97e9-1ff0d1697186" height="200"/>
</div>

<hr/>

### Example Dataset: Ham or Spam

In [1]:
import warnings
import pandas as pd
warnings.filterwarnings("ignore")

- Reading the data from the source file: 

In [2]:
df = pd.read_csv("./email_spam.csv")
df.head()

Unnamed: 0,title,text,type
0,?? the secrets to SUCCESS,"Hi James,\r\n\r\nHave you claim your complimen...",spam
1,?? You Earned 500 GCLoot Points,"\r\nalt_text\r\nCongratulations, you just earn...",not spam
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\r\...",not spam
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\r\n \r\nThank you for contacting the Vi...",not spam
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\r\n\r\nToday's newsletter...",spam


- Here is a sample of the emails

"Hi Walid,

Do you listen to music on Spotify, YouTube, Amazon or Apple?

If you do - you qualify!

You could be making $50 for every song you stream...

All it takes is 3 steps...

Step 1: Create Your Account
Create your account here

Step 2: Pick Your Favourite Artist
Select from thousands of artists and vibe to the music

Step 3: Get Paid
That's it, for every song you stream...

=> Click here right now to start instantly

Regards,

Alex

---
?? Connect with us on Telegram: https://t.me/moneymakingcentral"

## <div align ="center"> Preprocessing techniques </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/47bf60bf-3078-47c1-bb58-2f72b9c9a9f2" height="200">
</div>

- Tokenization
- Stop word removal
- Stemming
- Rare word removal


### Motivation
- Reduce features
- Cleaner, more representative datasets
- **Improving Data Quality** Removing noise and irrelevant information ensures that the data fed into the model is clean and consistent.
<hr/>

### 1. **Tokenization**
- Tokens or words are extracted from text
- Tokenization using torchtext.

In [3]:
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")
tokens = tokenizer("I am reading a book now. I love to read books!")

print(tokens)

['i', 'am', 'reading', 'a', 'book', 'now', '.', 'i', 'love', 'to', 'read', 'books', '!']


In [4]:
df['text_tokens'] = df['text'].apply(tokenizer)
df.head()

Unnamed: 0,title,text,type,text_tokens
0,?? the secrets to SUCCESS,"Hi James,\r\n\r\nHave you claim your complimen...",spam,"[hi, james, ,, have, you, claim, your, complim..."
1,?? You Earned 500 GCLoot Points,"\r\nalt_text\r\nCongratulations, you just earn...",not spam,"[alt_text, congratulations, ,, you, just, earn..."
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\r\...",not spam,"[here, ', s, your, github, launch, code, ,, @m..."
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\r\n \r\nThank you for contacting the Vi...",not spam,"[hello, ,, thank, you, for, contacting, the, v..."
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\r\n\r\nToday's newsletter...",spam,"[hey, prachanda, rawal, ,, today, ', s, newsle..."


In [5]:
df['text_tokens'].sample().item()

['hello',
 'sathya',
 'narayanan',
 ',',
 'our',
 'team',
 'reply',
 'to',
 'your',
 'ticket',
 'from',
 '2023-06-17',
 '07',
 '05',
 '04',
 '.',
 'to',
 'see',
 'the',
 'reply',
 'please',
 'click',
 'here',
 'https',
 '//offers',
 '.',
 'cpx-research',
 '.',
 'com/ticket',
 '.',
 'php',
 '?',
 'tid=238212&hash=shmykfq0usb8lluap635y67lr1yigcsiqwn8but5gcoznqn69qquabrgpbmrcblui34exlazilz6ks4ebbxpdyyoiznxnhiprslu2wctvp8acvaqayvm0ul0yccaga3qjjw7rbrmhqyu0upl2xdmq',
 'your',
 'cpx',
 'research',
 'customer',
 'happiness',
 'team',
 'ps',
 'always',
 'read',
 'and',
 'answer',
 'surveys',
 'careful',
 ',',
 'there',
 'might',
 'be',
 'hidden',
 'test',
 'questions',
 'checking',
 'if',
 'you',
 'pay',
 'attention',
 '.',
 'also',
 'if',
 'youre',
 'replying',
 'too',
 'fast',
 ',',
 'some',
 'partners',
 'will',
 'not',
 'pay',
 'your',
 'reward',
 '.',
 'tickethash=shmykfq0usb8lluap635y67lr1yigcsiqwn8but5gcoznqn69qquabrgpbmrcblui34exlazilz6ks4ebbxpdyyoiznxnhiprslu2wctvp8acvaqayvm0ul0yccaga3

<hr/>

### **2. Stop word removal**
- Eliminate common words that do not contribute to the meaning
- Stop words: "a", "the", "and", "or", and more

In [6]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

print(filtered_tokens)

['reading', 'book', '.', 'love', 'read', 'books', '!']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
def remove_stopwords(tokens):
    return [token for token in tokens if token.lower() not in stop_words]

In [8]:
df['remove_stopwords'] = df['text_tokens'].apply(remove_stopwords)
df.head()

Unnamed: 0,title,text,type,text_tokens,remove_stopwords
0,?? the secrets to SUCCESS,"Hi James,\r\n\r\nHave you claim your complimen...",spam,"[hi, james, ,, have, you, claim, your, complim...","[hi, james, ,, claim, complimentary, gift, yet..."
1,?? You Earned 500 GCLoot Points,"\r\nalt_text\r\nCongratulations, you just earn...",not spam,"[alt_text, congratulations, ,, you, just, earn...","[alt_text, congratulations, ,, earned, 500, co..."
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\r\...",not spam,"[here, ', s, your, github, launch, code, ,, @m...","[', github, launch, code, ,, @mortyj420, !, oc..."
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\r\n \r\nThank you for contacting the Vi...",not spam,"[hello, ,, thank, you, for, contacting, the, v...","[hello, ,, thank, contacting, virtual, reward,..."
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\r\n\r\nToday's newsletter...",spam,"[hey, prachanda, rawal, ,, today, ', s, newsle...","[hey, prachanda, rawal, ,, today, ', newslette..."


In [9]:
df['remove_stopwords'].sample().item()

['dear',
 'joseph',
 'alex',
 'eze',
 'pleased',
 'inform',
 'part',
 'unicaf’s',
 '10-year',
 'anniversary',
 ',',
 'selected',
 'special',
 'scholarship',
 'allow',
 'study',
 'towards',
 'british',
 'master’s',
 'degree',
 'choice',
 '?',
 '1',
 ',',
 '950',
 '.',
 'excited',
 'join',
 'us',
 'believe',
 'program',
 'provide',
 'skills',
 'knowledge',
 'need',
 'succeed',
 'chosen',
 'field',
 '.',
 '90%',
 'graduates',
 'employment',
 'earned',
 'higher',
 'salary',
 '.',
 'believe',
 'opportunity',
 'allow',
 'pursue',
 'educational',
 'goals',
 'without',
 'financial',
 'burden',
 'accompanies',
 'master',
 "'",
 'degree',
 'programme',
 '.',
 'encourage',
 'take',
 'advantage',
 'offer',
 'join',
 'us',
 'pursuit',
 'knowledge',
 'academic',
 'excellence',
 'replying',
 'email',
 '.',
 'take',
 'advantage',
 'offer',
 'please',
 'quote',
 'adviser',
 'code',
 'ng1950',
 '.',
 'sincerely',
 ',',
 'olusegun',
 'onyinyechi',
 'scholarship',
 'adviser']

<hr/>

### **3. Stemming**
- Reducing words to their base form
- For example: "running", "runs", "ran" becomes run

In [10]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
filtered_tokens = ["reading", "book", ".", "love", "read", "books", "!"]
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

print(stemmed_tokens)

['read', 'book', '.', 'love', 'read', 'book', '!']


In [11]:
def stemming(filtered_tokens):
    return [stemmer.stem(token) for token in filtered_tokens]

In [12]:
df['steemed_tokens'] = df['remove_stopwords'].apply(stemming)
df.head()

Unnamed: 0,title,text,type,text_tokens,remove_stopwords,steemed_tokens
0,?? the secrets to SUCCESS,"Hi James,\r\n\r\nHave you claim your complimen...",spam,"[hi, james, ,, have, you, claim, your, complim...","[hi, james, ,, claim, complimentary, gift, yet...","[hi, jame, ,, claim, complimentari, gift, yet,..."
1,?? You Earned 500 GCLoot Points,"\r\nalt_text\r\nCongratulations, you just earn...",not spam,"[alt_text, congratulations, ,, you, just, earn...","[alt_text, congratulations, ,, earned, 500, co...","[alt_text, congratul, ,, earn, 500, complet, f..."
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\r\...",not spam,"[here, ', s, your, github, launch, code, ,, @m...","[', github, launch, code, ,, @mortyj420, !, oc...","[', github, launch, code, ,, @mortyj420, !, oc..."
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\r\n \r\nThank you for contacting the Vi...",not spam,"[hello, ,, thank, you, for, contacting, the, v...","[hello, ,, thank, contacting, virtual, reward,...","[hello, ,, thank, contact, virtual, reward, ce..."
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\r\n\r\nToday's newsletter...",spam,"[hey, prachanda, rawal, ,, today, ', s, newsle...","[hey, prachanda, rawal, ,, today, ', newslette...","[hey, prachanda, rawal, ,, today, ', newslett,..."


In [13]:
df['steemed_tokens'].sample().item()

['dear',
 'maryam',
 ',',
 'would',
 'like',
 'thank',
 'applic',
 'role',
 'child',
 'protect',
 'emerg',
 'specialist',
 ',',
 'maiduguri',
 '-',
 'nigeria',
 'interest',
 'plan',
 'intern',
 '.',
 'regret',
 'inform',
 'occas',
 'success',
 'applic',
 '.',
 'howev',
 ',',
 'may',
 'posit',
 'suitabl',
 'skill',
 '.',
 'pleas',
 'feel',
 'free',
 'view',
 'current',
 'vacanc']

<hr/>

### **4. Rare word removal**
- Removing infrequent words that don't add value

<img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/b5c58500-a539-4042-ba89-4b8db816e359" height="500"/>

In [14]:
from nltk.probability import FreqDist

stemmed_tokens = ["read", "book", ".", "love", "read", "book", "!"]
freq_dist = FreqDist(stemmed_tokens)
threshold = 1

common_tokens = [token for token in stemmed_tokens if freq_dist[token] > threshold]
print(common_tokens)

['read', 'book', 'read', 'book']


In [15]:
def remove_rare(stemmed_tokens):
    freq_dist = FreqDist(stemmed_tokens)
    return [token for token in stemmed_tokens if freq_dist[token] > 1]

In [16]:
df['rare_words_removed'] = df['steemed_tokens'].apply(remove_rare)
df.head()

Unnamed: 0,title,text,type,text_tokens,remove_stopwords,steemed_tokens,rare_words_removed
0,?? the secrets to SUCCESS,"Hi James,\r\n\r\nHave you claim your complimen...",spam,"[hi, james, ,, have, you, claim, your, complim...","[hi, james, ,, claim, complimentary, gift, yet...","[hi, jame, ,, claim, complimentari, gift, yet,...","[,, claim, gift, ?, gift, ?, ., >>, claim, >>,..."
1,?? You Earned 500 GCLoot Points,"\r\nalt_text\r\nCongratulations, you just earn...",not spam,"[alt_text, congratulations, ,, you, just, earn...","[alt_text, congratulations, ,, earned, 500, co...","[alt_text, congratul, ,, earn, 500, complet, f...","[,, earn, point, earn, point, ,, ,, ,, hong, k..."
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\r\...",not spam,"[here, ', s, your, github, launch, code, ,, @m...","[', github, launch, code, ,, @mortyj420, !, oc...","[', github, launch, code, ,, @mortyj420, !, oc...","[github, code, github, code, github]"
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\r\n \r\nThank you for contacting the Vi...",not spam,"[hello, ,, thank, you, for, contacting, the, v...","[hello, ,, thank, contacting, virtual, reward,...","[hello, ,, thank, contact, virtual, reward, ce...","[,, thank, contact, virtual, reward, center, ...."
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\r\n\r\nToday's newsletter...",spam,"[hey, prachanda, rawal, ,, today, ', s, newsle...","[hey, prachanda, rawal, ,, today, ', newslette...","[hey, prachanda, rawal, ,, today, ', newslett,...","[,, today, ', day, ,, insid, play, ,, video, p..."


- This is how the final preprocessed text data would look like:

In [17]:
proccessed_text = df['rare_words_removed']
proccessed_text.sample(10)

44    [notic, login, ,, alexxuzi, notic, login, devi...
38                                            [., ., .]
8              [,, ., ,, turkey, ,, turkey, ,, ., ,, .]
47    [doordash, order, ,, estim, order, 9, pm, 9, p...
36    [., ,, statement, account, (, ), ., pleas, acc...
31    [., ., ., sex, ?, ?, ., need, ., need, sex, ., .]
59    [netflix, ,, ,, find, inform, request, netflix...
17    [pleas, ,, scholarship, allow, degre, ,, ., jo...
68    [,, 2023, ,, make, chang, googl, play, term, s...
61    [,, file, ., today, ,, file, return, ., file, ...
Name: rare_words_removed, dtype: object

<hr/>

## Preprocessing techniques Recap
- Tokenization
- stopword removal
- stemming
- rare word removal
- More techniques exist

<hr/>

## <div align ="center"> Encoding techniques </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/7bdbe523-7e07-442b-87b9-e35e602d49f5" height="120"/>
</div>

### Motivation
- covert text into machine-readable numbers
- Enable analysis and modeling

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/ac62b9af-5c9d-4bc5-a643-3df0ee31c394" height="500"/>
</div>

## 

- Allows models to understand and process text
- Choose one technique to avoid redudancy
- More techniques exist

## Encoding Techniques
- One-hot encoding: transforms words into unique numerical representations
- Bag-of-Words (BoW): captures word frequency, disregarding order
- TF-IDF: balances uniqueness and importance
<hr/>

### **1. One-hot encoding**
- Mapping each word to a distinct vector

Binary vector:
- 1 for the presence of a word
- 0 for the absence of a word

['cat', 'dog', 'rabbit']

'cat' [1, 0, 0]

'dog' [0, 1, 0]

'rabbit' [0, 0, 1]

In [18]:
import torch

vocab = ['cat', 'dog', 'rabbit']
vocab_size = len(vocab)

one_hot_vectors = torch.eye(vocab_size)
one_hot_dict = {word: one_hot_vectors[i] for i, word in enumerate(vocab)}

print(one_hot_dict)

{'cat': tensor([1., 0., 0.]), 'dog': tensor([0., 1., 0.]), 'rabbit': tensor([0., 0., 1.])}


In [19]:
def one_hot_encoding(row):
    vocab = set(row)
    vocab_size = len(vocab)
    one_hot_vectors = torch.eye(vocab_size)
    # return {word: one_hot_vectors[i] for i, word in enumerate(vocab)}
    return [one_hot_vectors[i] for i, word in enumerate(vocab)]

In [20]:
df['ohe'] = df['rare_words_removed'].apply(one_hot_encoding)
df['ohe'].head()

0    [[tensor(1.), tensor(0.), tensor(0.), tensor(0...
1    [[tensor(1.), tensor(0.), tensor(0.), tensor(0...
2    [[tensor(1.), tensor(0.)], [tensor(0.), tensor...
3    [[tensor(1.), tensor(0.), tensor(0.), tensor(0...
4    [[tensor(1.), tensor(0.), tensor(0.), tensor(0...
Name: ohe, dtype: object

In [21]:
df['ohe'][1]

[tensor([1., 0., 0., 0., 0.]),
 tensor([0., 1., 0., 0., 0.]),
 tensor([0., 0., 1., 0., 0.]),
 tensor([0., 0., 0., 1., 0.]),
 tensor([0., 0., 0., 0., 1.])]

<hr/>

### **2. Bag of words**
- Example: "The cat sat on the mat"
- Bag-of-words: {'the': 2, 'cat': 1, 'sat': 1, 'on': 1, 'mat': 1}

- Treating each document as an unordered collection of words
-  Focuses on frequency, not order

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?']
X = vectorizer.fit_transform(corpus)

print(X.toarray())
print(vectorizer.get_feature_names_out())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


In [32]:
vectorizer = CountVectorizer()

def encode_count_vector(row):
    row = ' '.join(row)
    return vectorizer.fit_transform([row])

In [33]:
df['cvector_encoded'] = df['remove_stopwords'].apply(encode_count_vector)
df['cvector_encoded'].head()

0      (0, 13)\t1\n  (0, 14)\t1\n  (0, 2)\t3\n  (0,...
1      (0, 4)\t1\n  (0, 11)\t1\n  (0, 13)\t1\n  (0,...
2      (0, 4)\t3\n  (0, 5)\t1\n  (0, 1)\t2\n  (0, 6...
3      (0, 10)\t1\n  (0, 26)\t2\n  (0, 5)\t1\n  (0,...
4      (0, 174)\t1\n  (0, 261)\t1\n  (0, 270)\t1\n ...
Name: cvector_encoded, dtype: object

In [34]:
df['cvector_encoded'][1].toarray()

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1,
        1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int64)

<hr/>

### **3. TF-IDF**
Term Frequency-Inverse Document Frequency
- Scores the importance of words in a document
- Rare words have a higher score
- Common ones have a lower score
- Emphasizes informative words

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
corpus = ['This is the first document.','This document is the second document.', 'And this is the third one.','Is this the first document?']

X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_feature_names_out())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


In [51]:
vectorizer_1 = TfidfVectorizer()

def tf_idf_transform(row):
    return vectorizer_1.fit_transform(row).toarray()

In [52]:
df['tfid_vector_encoded'] = df['remove_stopwords'].apply(tf_idf_transform)
df['tfid_vector_encoded'].head()

0    [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
1    [[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0,...
2    [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
3    [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
4    [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
Name: tfid_vector_encoded, dtype: object

In [53]:
print(df['tfid_vector_encoded'][1])
print(vectorizer_1.get_feature_names_out())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 1. 0. 0.]]
['90' 'act' 'appear' 'bard' 'click' 'com' 'community' 'company' 'days'
 'earned' 'expire' 'explorers' 'follow' 'fulfillment' 'gift' 'go' 'icon'
 'india' 'instructions' 'links' 'member' 'online' 'page' 'please'
 'profile' 'protection' 'provided' 'questions' 'quickly' 'reach'
 'received' 'redeem' 'see' 'sign' 'support' 'top' 'virtualrewardcenter'
 'visit']


<hr/>

## Encodong Techniques REcap
- One hot encoding
- Words of bags
- TF-IDF encoding
- More techniques exist

In [None]:
# Import libraries
from torch.utils.data import Dataset, DataLoader

# Create a class
class TextDataset(Dataset):
    def __init__(self, text):
        self.text = text
    def __len__(self):
        return len(self.text)
    def __getitem__(self, idx):
        return self.text[idx]

## Full Text preparation pipeline

In [None]:
def preprocess_sentences(sentences):
    processed_sentences = []
    
    for sentence in sentences:
        sentence = sentence.lower()
        tokens = tokenizer(sentence)
        tokens = [token for token in tokens if token not in stop_words]
        tokens = [stemmer.stem(token) for token in tokens]
        freq_dist = FreqDist(tokens)
        threshold = 2
        tokens = [token for token in tokens if freq_dist[token] > threshold]
        processed_sentences.append(' '.join(tokens))
    
    return processed_sentences

In [None]:
def encode_sentences(sentences):
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(sentences)
    encoded_sentences = X.toarray()
    return encoded_sentences, vectorizer

In [None]:
import re
def extract_sentences(data):
    sentences = re.findall(r'[A-Z][^.!?]*[.!?]', data)
    return sentences

In [None]:
def text_processing_pipeline(text):
    tokens = preprocess_sentences(text)
    encoded_sentences, vectorizer = encode_sentences(tokens)
    dataset = TextDataset(encoded_sentences)
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
    return dataloader, vectorizer

In [None]:
text_data = "This is the first text data. And here is another one."
sentences = extract_sentences(text_data)
dataloader, vectorizer = [text_processing_pipeline(text) for text in sentences]
print(next(iter(dataloader))[0, :10])