## <div align="center"> Text processing pipeline </div>
Clean and prepare text for classification tasks and others.

<hr/>

## <div align ="center">  Pipeline of handling text data sets </div>

<div align ="center"> 
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/b7baeb0c-92dc-4cd1-b99a-8dbe9bb7f9d0" height="200">
</div>

<hr/>

### Use cases
- Sentiment analysis
- Text summarization
- Machine translation

<hr/>

## <div align ="center"> Tools </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/c15b4b04-fc4c-4d09-97e9-1ff0d1697186" height="200"/>
</div>

<hr/>

### Example Dataset: Ham or Spam

In [1]:
import warnings
import pandas as pd
warnings.filterwarnings("ignore")

- Reading the data from the source file: 

In [2]:
df = pd.read_csv("./email_spam.csv")
df.head()

Unnamed: 0,title,text,type
0,?? the secrets to SUCCESS,"Hi James,\n\nHave you claim your complimentary...",spam
1,?? You Earned 500 GCLoot Points,"\nalt_text\nCongratulations, you just earned\n...",not spam
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\n ...",not spam
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\n \nThank you for contacting the Virtua...",not spam
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\n\nToday's newsletter is ...",spam


- Here is a sample of the emails

"Hi Walid,

Do you listen to music on Spotify, YouTube, Amazon or Apple?

If you do - you qualify!

You could be making $50 for every song you stream...

All it takes is 3 steps...

Step 1: Create Your Account
Create your account here

Step 2: Pick Your Favourite Artist
Select from thousands of artists and vibe to the music

Step 3: Get Paid
That's it, for every song you stream...

=> Click here right now to start instantly

Regards,

Alex

---
?? Connect with us on Telegram: https://t.me/moneymakingcentral"

## <div align ="center"> Preprocessing techniques </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/47bf60bf-3078-47c1-bb58-2f72b9c9a9f2" height="200">
</div>

- Tokenization
- Stop word removal
- Stemming
- Rare word removal


### Motivation
- Reduce features
- Cleaner, more representative datasets
<hr/>

### 1. **Tokenization**
- Tokens or words are extracted from text
- Tokenization using torchtext.

In [3]:
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")
tokens = tokenizer("I am reading a book now. I love to read books!")

print(tokens)

['i', 'am', 'reading', 'a', 'book', 'now', '.', 'i', 'love', 'to', 'read', 'books', '!']


In [4]:
df['text_tokens'] = df['text'].apply(tokenizer)
df.head()

Unnamed: 0,title,text,type,text_tokens
0,?? the secrets to SUCCESS,"Hi James,\n\nHave you claim your complimentary...",spam,"[hi, james, ,, have, you, claim, your, complim..."
1,?? You Earned 500 GCLoot Points,"\nalt_text\nCongratulations, you just earned\n...",not spam,"[alt_text, congratulations, ,, you, just, earn..."
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\n ...",not spam,"[here, ', s, your, github, launch, code, ,, @m..."
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\n \nThank you for contacting the Virtua...",not spam,"[hello, ,, thank, you, for, contacting, the, v..."
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\n\nToday's newsletter is ...",spam,"[hey, prachanda, rawal, ,, today, ', s, newsle..."


<hr/>

### **2.Stop word removal**
- Eliminate common words that do not contribute to the meaning
- Stop words: "a", "the", "and", "or", and more

In [5]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

print(filtered_tokens)

['reading', 'book', '.', 'love', 'read', 'books', '!']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
def remove_stopwords(tokens):
    return [token for token in tokens if token.lower() not in stop_words]

In [7]:
df['remove_stopwords'] = df['text_tokens'].apply(remove_stopwords)
df.head()

Unnamed: 0,title,text,type,text_tokens,remove_stopwords
0,?? the secrets to SUCCESS,"Hi James,\n\nHave you claim your complimentary...",spam,"[hi, james, ,, have, you, claim, your, complim...","[hi, james, ,, claim, complimentary, gift, yet..."
1,?? You Earned 500 GCLoot Points,"\nalt_text\nCongratulations, you just earned\n...",not spam,"[alt_text, congratulations, ,, you, just, earn...","[alt_text, congratulations, ,, earned, 500, co..."
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\n ...",not spam,"[here, ', s, your, github, launch, code, ,, @m...","[', github, launch, code, ,, @mortyj420, !, oc..."
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\n \nThank you for contacting the Virtua...",not spam,"[hello, ,, thank, you, for, contacting, the, v...","[hello, ,, thank, contacting, virtual, reward,..."
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\n\nToday's newsletter is ...",spam,"[hey, prachanda, rawal, ,, today, ', s, newsle...","[hey, prachanda, rawal, ,, today, ', newslette..."


<hr/>

### **3. Stemming**
- Reducing words to their base form
- For example: "running", "runs", "ran" becomes run

In [8]:
import nltk
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
filtered_tokens = ["reading", "book", ".", "love", "read", "books", "!"]
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

print(stemmed_tokens)

['read', 'book', '.', 'love', 'read', 'book', '!']


In [9]:
def stemming(filtered_tokens):
    return [stemmer.stem(token) for token in filtered_tokens]

In [10]:
df['steemed_tokens'] = df['remove_stopwords'].apply(stemming)
df.head()

Unnamed: 0,title,text,type,text_tokens,remove_stopwords,steemed_tokens
0,?? the secrets to SUCCESS,"Hi James,\n\nHave you claim your complimentary...",spam,"[hi, james, ,, have, you, claim, your, complim...","[hi, james, ,, claim, complimentary, gift, yet...","[hi, jame, ,, claim, complimentari, gift, yet,..."
1,?? You Earned 500 GCLoot Points,"\nalt_text\nCongratulations, you just earned\n...",not spam,"[alt_text, congratulations, ,, you, just, earn...","[alt_text, congratulations, ,, earned, 500, co...","[alt_text, congratul, ,, earn, 500, complet, f..."
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\n ...",not spam,"[here, ', s, your, github, launch, code, ,, @m...","[', github, launch, code, ,, @mortyj420, !, oc...","[', github, launch, code, ,, @mortyj420, !, oc..."
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\n \nThank you for contacting the Virtua...",not spam,"[hello, ,, thank, you, for, contacting, the, v...","[hello, ,, thank, contacting, virtual, reward,...","[hello, ,, thank, contact, virtual, reward, ce..."
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\n\nToday's newsletter is ...",spam,"[hey, prachanda, rawal, ,, today, ', s, newsle...","[hey, prachanda, rawal, ,, today, ', newslette...","[hey, prachanda, rawal, ,, today, ', newslett,..."


<hr/>

### **4. Rare word removal**
- Removing infrequent words that don't add value

In [11]:
from nltk.probability import FreqDist

stemmed_tokens= ["read", "book", ".", "love", "read", "book", "!"]
freq_dist = FreqDist(stemmed_tokens)
threshold = 1

common_tokens = [token for token in stemmed_tokens if freq_dist[token] > threshold]
print(common_tokens)

['read', 'book', 'read', 'book']


In [12]:
def remove_rare(stemmed_tokens):
    freq_dist = FreqDist(stemmed_tokens)
    return [token for token in stemmed_tokens if freq_dist[token] > 1]

In [13]:
df['rare_words_removed'] = df['steemed_tokens'].apply(remove_rare)
df.tail()

Unnamed: 0,title,text,type,text_tokens,remove_stopwords,steemed_tokens,rare_words_removed
79,Your application for the position of Child Pr...,"Dear Maryam, \n\n \n\nI would like to thank yo...",not spam,"[dear, maryam, ,, i, would, like, to, thank, y...","[dear, maryam, ,, would, like, thank, applicat...","[dear, maryam, ,, would, like, thank, applic, ...","[,, applic, ,, ., applic, ., ,, .]"
80,Your Kilimall Account is Ready - Shopping Now!,"Dear Customer,\n\nWelcome to Kilimall, Thanks ...",not spam,"[dear, customer, ,, welcome, to, kilimall, ,, ...","[dear, customer, ,, welcome, kilimall, ,, than...","[dear, custom, ,, welcom, kilimal, ,, thank, m...","[custom, ,, kilimal, ,, much, ., kilimal, afri..."
81,Your Steam account: Access from new web or mob...,"Dear vladis163rus,\nHere is the Steam Guard co...",not spam,"[dear, vladis163rus, ,, here, is, the, steam, ...","[dear, vladis163rus, ,, steam, guard, code, ne...","[dear, vladis163ru, ,, steam, guard, code, nee...","[vladis163ru, steam, guard, code, login, accou..."
82,Your uploaded document is rejected,View In Browser | Log in\n \n \n\nSkrill logo\...,not spam,"[view, in, browser, |, log, in, skrill, logo, ...","[view, browser, |, log, skrill, logo, money, m...","[view, browser, |, log, skrill, logo, money, m...","[|, skrill, money, ?, ?, couldn’t, verifi, add..."
83,You've Earned a Reward from Bard Explorers India,You've received a gift!\nSign in to your Bard ...,not spam,"[you, ', ve, received, a, gift, !, sign, in, t...","[', received, gift, !, sign, bard, explorers, ...","[', receiv, gift, !, sign, bard, explor, india...","[gift, !, bard, explor, india, commun, member,..."


- This is how the final preprocessed text data would look like:

In [14]:
proccessed_text = df['rare_words_removed']
proccessed_text.sample(20)

46                                                   []
59    [netflix, ,, ,, find, inform, request, netflix...
24    [respond, feedback, feedback, us, ., ', ,, us,...
66                       [,, ,, ,, ., ,, ., best, best]
29    [,, ,, find, supplement, program, ., supplemen...
62    [,, paypal, $8, ,, 32, usd, ., transact, trans...
30            [singl, look, ?, ?, ', ', look, ', singl]
75    [!, applic, process, job, -, need, (, ), (, jo...
73    [?, ?, six, pack, ?, ,, engin, •, answer, ,, !...
33    [top, stori, quora, ?, code, c++, hard, ,, man...
44    [notic, login, ,, alexxuzi, notic, login, devi...
40    [live, ?, ?, live, get, ., get, hifi, hifi, ,,...
54    [snapchat+, ,, ', ., snapchat+, plan, set, ren...
56                               [microsoft, microsoft]
78    [amazon, order, confirm, maliek, ,, shop, ., c...
41    [,, zoom, call, -, ,, (, ), ., regist, free, s...
53    [sale, execut, sale, execut, posit, ., review,...
2                  [github, code, github, code, 

<hr/>

## Preprocessing techniques Recap
- Tokenization
- stopword removal
- stemming
- rare word removal
- More techniques exist

<hr/>

## <div align ="center"> Encoding techniques </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/7bdbe523-7e07-442b-87b9-e35e602d49f5" height="120"/>
</div>

### Motivation
- covert text into machine-readable numbers
- Enable analysis and modeling

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/ac62b9af-5c9d-4bc5-a643-3df0ee31c394" height="500"/>
</div>

## 

- Allows models to understand and process text
- Choose one technique to avoid redudancy
- More techniques exist

## Encoding Techniques
- One-hot encoding: transforms words into unique numerical representations
- Bag-of-Words (BoW): captures word frequency, disregarding order
- TF-IDF: balances uniqueness and importance
<hr/>

### **1. One-hot encoding**
- Mapping each word to a distinct vector

Binary vector:
- 1 for the presence of a word
- 0 for the absence of a word

['cat', 'dog', 'rabbit']

'cat' [1, 0, 0]

'dog' [0, 1, 0]

'rabbit' [0, 0, 1]

In [15]:
import torch

vocab = ['cat', 'dog', 'rabbit']
vocab_size = len(vocab)

one_hot_vectors = torch.eye(vocab_size)
one_hot_dict = {word: one_hot_vectors[i] for i, word in enumerate(vocab)}

print(one_hot_dict)

{'cat': tensor([1., 0., 0.]), 'dog': tensor([0., 1., 0.]), 'rabbit': tensor([0., 0., 1.])}


In [40]:
def one_hot_encoding(row):
    vocab = set(row)
    vocab_size = len(vocab)
    one_hot_vectors = torch.eye(vocab_size)
    # return {word: one_hot_vectors[i] for i, word in enumerate(vocab)}
    return [one_hot_vectors[i] for i, word in enumerate(vocab)]

In [42]:
df['ohe'] = df['rare_words_removed'].apply(one_hot_encoding)
df['ohe'].head()

0    [[tensor(1.), tensor(0.), tensor(0.), tensor(0...
1    [[tensor(1.), tensor(0.), tensor(0.), tensor(0...
2    [[tensor(1.), tensor(0.)], [tensor(0.), tensor...
3    [[tensor(1.), tensor(0.), tensor(0.), tensor(0...
4    [[tensor(1.), tensor(0.), tensor(0.), tensor(0...
Name: ohe, dtype: object

In [45]:
df['ohe'][1]

[tensor([1., 0., 0., 0., 0.]),
 tensor([0., 1., 0., 0., 0.]),
 tensor([0., 0., 1., 0., 0.]),
 tensor([0., 0., 0., 1., 0.]),
 tensor([0., 0., 0., 0., 1.])]

<hr/>

### **2. Bag of words**
- Example: "The cat sat on the mat"
- Bag-of-words: {'the': 2, 'cat': 1, 'sat': 1, 'on': 1, 'mat': 1}

- Treating each document as an unordered collection of words
-  Focuses on frequency, not order

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?']
X = vectorizer.fit_transform(corpus)

print(X.toarray())
print(vectorizer.get_feature_names_out())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


<hr/>

### **3. TF-IDF**
Term Frequency-Inverse Document Frequency
- Scores the importance of words in a document
- Rare words have a higher score
- Common ones have a lower score
- Emphasizes informative words

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
corpus = ['This is the first document.','This document is the second document.', 'And this is the third one.','Is this the first document?']

X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_feature_names_out())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


<hr/>

## Encodong Techniques REcap
- One hot encoding
- Words of bags
- TF-IDF encoding
- More techniques exist