![image](https://github.com/sondosaabed/SP.TOP-Data-Science-and-Analytics/assets/65151701/c4d8ed09-3b36-4829-b5dc-eb74f17b424c)

### Objectives
- dataset as container
- dataloader: batching, shuffling, multiprocessing

In [22]:
from torch.utils.data import Dataset, DataLoader
from torchtext.data.utils import get_tokenizer
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.probability import FreqDist
from sklearn.feature_extraction.text import CountVectorizer
import re

In [23]:
class TextDataset(Dataset):
    def __init__(self, text):
        self.text = text
    def __len__(self):
        return len(self.text)
    def __getitem__(self, index):
        return self.text[index]

In [25]:
tokenizer = get_tokenizer("basic_english")
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
def preprocess_sentences(sentences):
    """
    """
    processed_sentences = []
    for sentence in sentences:
        sentence = sentence.lower()
        tokens = tokenizer(sentence)
        tokens = [token for token in tokens if token not in stop_words]
        tokens = [stemmer.stem(token) for token in tokens]
        freq_dist = FreqDist(tokens)
        threshold = 2
        tokens = [token for token in tokens if freq_dist[token] > threshold]
        processed_sentences.append(' '.join(tokens))
    return processed_sentences

In [18]:
def encode_sentences(sentences):
    """
    """
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(sentences)
    encoded_sentences = X.toarray()
    return encoded_sentences, vectorizer

In [19]:
def extract_sentences(data):
    """
    """
    sentences = re.findall(r'[A-Z][^.!?]*[.!?]', data)
    return sentences

#### The whole text processing pipeline

In [20]:
def text_processing_pipeline(text):
    """
    """
    tokens = preprocess_sentences(text)
    encoded_sentences, vectorizer = encode_sentences(tokens)
    dataset = TextDataset(encoded_sentences)
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
    return dataloader, vectorizer

In [27]:
dataset = TextDataset("This is the first text data.  And here is another one. And here is another one.")
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

text_data = "This is the first text data.  And here is another one. And here is another one."
sentences = extract_sentences(text_data)
dataloaders, vectorizer = [text_processing_pipeline(text) for text in sentences]
print(next(iter(dataloader))[0, :10])

ValueError: empty vocabulary; perhaps the documents only contain stop words

### Practice

Shakespearean language preprocessing pipeline

Over at PyBooks, the team wants to transform a vast library of Shakespearean text data for further analysis. The most efficient way to do this is with a text processing pipeline, starting with the preprocessing steps.

The following have been loaded for you: torch, nltk, stopwords, PorterStemmer, get_tokenizer.

The Shakespearean text data is saved as shakespeare and the sentences have already been extracted.

    Create a list of unique English stopwords, saving to them to stop_words.

    Initialize the basic_english tokenizer from torch, and PorterStemmer from nltk.

    Complete the preprocess_sentences() function to enable tokenization, stop word removal, and stemming.


In [34]:
stop_words = set(stopwords.words('english'))

# Initialize the tokenizer and stemmer
tokenizer = get_tokenizer("basic_english")
stemmer = PorterStemmer() 

# Complete the function to preprocess sentences
def preprocess_sentences(sentences):
    processed_sentences = []
    for sentence in sentences:
        sentence = sentence.lower()
        tokens = tokenizer(sentence)
        tokens = [token for token in tokens if token not in stop_words]
        tokens = [stemmer.stem(token) for token in tokens]
        processed_sentences.append(' '.join(tokens))
    return processed_sentences

processed_shakespeare = preprocess_sentences("Complete the function to preprocess sentences. Initialize the tokenizer and stemmer ")
print(processed_shakespeare[:5])

['c', '', '', 'p', 'l']


hakespearean language encoder

With the preprocessed Shakespearean text at your fingertips, you now need to encode it into a numerical representation. You will need to define the encoding steps before putting the pipeline together. To better handle large amounts of data and efficiently perform the encoding, you will use PyTorch's Dataset and DataLoader for batching and shuffling the data.

The following has been loaded for you: torch, nltk, stopwords, PorterStemmer, get_tokenizer, CountVectorizer, Dataset, and DataLoader.

The processed_shakespeare from the Shakespearean text is also available to you.

    Define a ShakespeareDataset dataset class and complete the __init__ and __getitem__ methods.
    Complete the encode_sentences() function to take in a list of sentences and encode them using the bag-of-words technique from sklearn.
    
    Complete and call the text_processing_pipeline() function by using preprocess_sentences(), encode_sentences(), ShakespeareDataset class, and DataLoader.
    Print the first ten feature names with the get_feature_names_out() method and components of the first item of dataloader.


In [36]:
# Define your Dataset class
class ShakespeareDataset(Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx]
    
# Complete the encoding function
def encode_sentences(sentences):
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(sentences)
    return X.toarray(), vectorizer

# Complete the text processing pipeline
def text_processing_pipeline(sentences):
    processed_sentences = preprocess_sentences(sentences)
    encoded_sentences, vectorizer = encode_sentences(processed_sentences)
    dataset = ShakespeareDataset(encoded_sentences)
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
    return dataloader, vectorizer

dataloader, vectorizer = text_processing_pipeline(processed_shakespeare)

# Print the vectorizer's feature names and the first 10 components of the first item
print(vectorizer.get_feature_names_out()[:10]) 
print(next(iter(dataloader))[0, :10])

ValueError: empty vocabulary; perhaps the documents only contain stop words