## <div align="center"> Text processing pipeline </div>
Clean and prepare text for classification tasks and others.

<hr/>

### Use cases
- Sentiment analysis
- Text summarization
- Machine translation

<img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/df752feb-6081-4318-a3a5-125a1d4c68d4" height="600"/>

<hr/>

## <div align ="center">  Pipeline of handling text data sets </div>

<div align ="center"> 
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/b7baeb0c-92dc-4cd1-b99a-8dbe9bb7f9d0" height="200">
</div>

<hr/>

## <div align ="center"> Tools </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/c15b4b04-fc4c-4d09-97e9-1ff0d1697186" height="200"/>
</div>

<hr/>

### Example Dataset: Ham or Spam

In [None]:
import warnings
import pandas as pd
warnings.filterwarnings("ignore")

- Reading the data from the source file: 

In [None]:
df = pd.read_csv("./email_spam.csv")
df.head()

- Here is a sample of the emails

"Hi Walid,

Do you listen to music on Spotify, YouTube, Amazon or Apple?

If you do - you qualify!

You could be making $50 for every song you stream...

All it takes is 3 steps...

Step 1: Create Your Account
Create your account here

Step 2: Pick Your Favourite Artist
Select from thousands of artists and vibe to the music

Step 3: Get Paid
That's it, for every song you stream...

=> Click here right now to start instantly

Regards,

Alex

---
?? Connect with us on Telegram: https://t.me/moneymakingcentral"

## <div align ="center"> Preprocessing techniques </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/47bf60bf-3078-47c1-bb58-2f72b9c9a9f2" height="200">
</div>

- Tokenization
- Stop word removal
- Stemming
- Rare word removal


### Motivation
- Reduce features
- Cleaner, more representative datasets
- **Improving Data Quality** Removing noise and irrelevant information ensures that the data fed into the model is clean and consistent.
<hr/>

### 1. **Tokenization**
- Tokens or words are extracted from text
- Tokenization using torchtext.

In [None]:
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")
tokens = tokenizer("I am reading a book now. I love to read books!")

print(tokens)

In [None]:
df['text_tokens'] = df['text'].apply(tokenizer)
df.head()

In [None]:
df['text_tokens'].sample().item()

<hr/>

### **2. Stop word removal**
- Eliminate common words that do not contribute to the meaning
- Stop words: "a", "the", "and", "or", and more

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

print(filtered_tokens)

In [None]:
def remove_stopwords(tokens):
    return [token for token in tokens if token.lower() not in stop_words]

In [None]:
df['remove_stopwords'] = df['text_tokens'].apply(remove_stopwords)
df.head()

In [None]:
df['remove_stopwords'].sample().item()

<hr/>

### **3. Stemming**
- Reducing words to their base form
- For example: "running", "runs", "ran" becomes run

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
filtered_tokens = ["reading", "book", ".", "love", "read", "books", "!"]
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

print(stemmed_tokens)

In [None]:
def stemming(filtered_tokens):
    return [stemmer.stem(token) for token in filtered_tokens]

In [None]:
df['steemed_tokens'] = df['remove_stopwords'].apply(stemming)
df.head()

In [None]:
df['steemed_tokens'].sample().item()

<hr/>

### **4. Rare word removal**
- Removing infrequent words that don't add value

<img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/b5c58500-a539-4042-ba89-4b8db816e359" height="500"/>

In [None]:
from nltk.probability import FreqDist

stemmed_tokens = ["read", "book", ".", "love", "read", "book", "!"]
freq_dist = FreqDist(stemmed_tokens)
threshold = 1

common_tokens = [token for token in stemmed_tokens if freq_dist[token] > threshold]
print(common_tokens)

In [None]:
def remove_rare(stemmed_tokens):
    freq_dist = FreqDist(stemmed_tokens)
    return [token for token in stemmed_tokens if freq_dist[token] > 1]

In [None]:
df['rare_words_removed'] = df['steemed_tokens'].apply(remove_rare)
df.head()

- This is how the final preprocessed text data would look like:

In [None]:
proccessed_text = df['rare_words_removed']
proccessed_text.sample(10)

<hr/>

## Preprocessing techniques Recap
- Tokenization
- stopword removal
- stemming
- rare word removal
- More techniques exist

<hr/>

## <div align ="center"> Encoding techniques </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/7bdbe523-7e07-442b-87b9-e35e602d49f5" height="120"/>
</div>

### Motivation
- covert text into machine-readable numbers
- Enable analysis and modeling

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/ac62b9af-5c9d-4bc5-a643-3df0ee31c394" height="500"/>
</div>

## 

- Allows models to understand and process text
- Choose one technique to avoid redudancy
- More techniques exist

## Encoding Techniques
- One-hot encoding: transforms words into unique numerical representations
- Bag-of-Words (BoW): captures word frequency, disregarding order
- TF-IDF: balances uniqueness and importance
<hr/>

### **1. One-hot encoding**
- Mapping each word to a distinct vector

Binary vector:
- 1 for the presence of a word
- 0 for the absence of a word

['cat', 'dog', 'rabbit']

'cat' [1, 0, 0]

'dog' [0, 1, 0]

'rabbit' [0, 0, 1]

In [None]:
import torch

vocab = ['cat', 'dog', 'rabbit']
vocab_size = len(vocab)

one_hot_vectors = torch.eye(vocab_size)
one_hot_dict = {word: one_hot_vectors[i] for i, word in enumerate(vocab)}

print(one_hot_dict)

In [None]:
def one_hot_encoding(row):
    vocab = set(row)
    vocab_size = len(vocab)
    one_hot_vectors = torch.eye(vocab_size)
    # return {word: one_hot_vectors[i] for i, word in enumerate(vocab)}
    return [one_hot_vectors[i] for i, word in enumerate(vocab)]

In [None]:
df['ohe'] = df['rare_words_removed'].apply(one_hot_encoding)
df['ohe'].head()

In [None]:
df['ohe'][1]

<hr/>

### **2. Bag of words**
- Example: "The cat sat on the mat"
- Bag-of-words: {'the': 2, 'cat': 1, 'sat': 1, 'on': 1, 'mat': 1}

- Treating each document as an unordered collection of words
-  Focuses on frequency, not order

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?']
X = vectorizer.fit_transform(corpus)

print(X.toarray())
print(vectorizer.get_feature_names_out())

<hr/>

### **3. TF-IDF**
Term Frequency-Inverse Document Frequency
- Scores the importance of words in a document
- Rare words have a higher score
- Common ones have a lower score
- Emphasizes informative words

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
corpus = ['This is the first document.','This document is the second document.', 'And this is the third one.','Is this the first document?']

X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_feature_names_out())

<hr/>

## Encodong Techniques REcap
- One hot encoding
- Words of bags
- TF-IDF encoding
- More techniques exist

In [None]:
# Import libraries
from torch.utils.data import Dataset, DataLoader

# Create a class
class TextDataset(Dataset):
    def __init__(self, text):
        self.text = text
    def __len__(self):
        return len(self.text)
    def __getitem__(self, idx):
        return self.text[idx]

## Full Text preparation pipeline

In [None]:
def preprocess_sentences(sentences):
    processed_sentences = []
    
    for sentence in sentences:
        sentence = sentence.lower()
        tokens = tokenizer(sentence)
        tokens = [token for token in tokens if token not in stop_words]
        tokens = [stemmer.stem(token) for token in tokens]
        freq_dist = FreqDist(tokens)
        threshold = 2
        tokens = [token for token in tokens if freq_dist[token] > threshold]
        processed_sentences.append(' '.join(tokens))
    
    return processed_sentences

In [None]:
def encode_sentences(sentences):
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(sentences)
    encoded_sentences = X.toarray()
    return encoded_sentences, vectorizer

In [None]:
import re
def extract_sentences(data):
    sentences = re.findall(r'[A-Z][^.!?]*[.!?]', data)
    return sentences

In [None]:
def text_processing_pipeline(text):
    tokens = preprocess_sentences(text)
    encoded_sentences, vectorizer = encode_sentences(tokens)
    dataset = TextDataset(encoded_sentences)
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
    return dataloader, vectorizer

In [None]:
text_data = "This is the first text data. And here is another one."
sentences = extract_sentences(text_data)
dataloader, vectorizer = [text_processing_pipeline(text) for text in sentences]
print(next(iter(dataloader))[0, :10])