## <div align="center"> Text processing pipeline </div>
Clean and prepare text for classification tasks and others.

<hr/>

### Use cases
- Sentiment analysis
- Text summarization
- Machine translation

<img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/df752feb-6081-4318-a3a5-125a1d4c68d4" height="600"/>

<hr/>

## <div align ="center">  Pipeline of handling text data sets </div>

<div align ="center"> 
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/b7baeb0c-92dc-4cd1-b99a-8dbe9bb7f9d0" height="200">
</div>

<hr/>

## <div align ="center"> Tools </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/c15b4b04-fc4c-4d09-97e9-1ff0d1697186" height="200"/>
</div>

<hr/>

### Example Dataset: Ham or Spam

In [23]:
import warnings
import pandas as pd
warnings.filterwarnings("ignore")

- Reading the data from the source file: 

In [25]:
df = pd.read_csv("./email_spam.csv")
df.head()

Unnamed: 0,title,text,type
0,?? the secrets to SUCCESS,"Hi James,\n\nHave you claim your complimentary...",spam
1,?? You Earned 500 GCLoot Points,"\nalt_text\nCongratulations, you just earned\n...",not spam
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\n ...",not spam
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\n \nThank you for contacting the Virtua...",not spam
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\n\nToday's newsletter is ...",spam


- Here is a sample of the emails

"Hi Walid,

Do you listen to music on Spotify, YouTube, Amazon or Apple?

If you do - you qualify!

You could be making $50 for every song you stream...

All it takes is 3 steps...

Step 1: Create Your Account
Create your account here

Step 2: Pick Your Favourite Artist
Select from thousands of artists and vibe to the music

Step 3: Get Paid
That's it, for every song you stream...

=> Click here right now to start instantly

Regards,

Alex

---
?? Connect with us on Telegram: https://t.me/moneymakingcentral"

## <div align ="center"> Preprocessing techniques </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/47bf60bf-3078-47c1-bb58-2f72b9c9a9f2" height="200">
</div>

- Tokenization
- Stop word removal
- Stemming
- Rare word removal


### Motivation
- Reduce features
- Cleaner, more representative datasets
<hr/>

### 1. **Tokenization**
- Tokens or words are extracted from text
- Tokenization using torchtext.

In [26]:
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")
tokens = tokenizer("I am reading a book now. I love to read books!")

print(tokens)

['i', 'am', 'reading', 'a', 'book', 'now', '.', 'i', 'love', 'to', 'read', 'books', '!']


In [27]:
df['text_tokens'] = df['text'].apply(tokenizer)
df.head()

Unnamed: 0,title,text,type,text_tokens
0,?? the secrets to SUCCESS,"Hi James,\n\nHave you claim your complimentary...",spam,"[hi, james, ,, have, you, claim, your, complim..."
1,?? You Earned 500 GCLoot Points,"\nalt_text\nCongratulations, you just earned\n...",not spam,"[alt_text, congratulations, ,, you, just, earn..."
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\n ...",not spam,"[here, ', s, your, github, launch, code, ,, @m..."
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\n \nThank you for contacting the Virtua...",not spam,"[hello, ,, thank, you, for, contacting, the, v..."
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\n\nToday's newsletter is ...",spam,"[hey, prachanda, rawal, ,, today, ', s, newsle..."


In [28]:
df['text_tokens'].sample().item()

['model',
 'casting',
 'call',
 'thank',
 'you',
 'for',
 'taking',
 'the',
 'time',
 'to',
 'register',
 'for',
 'the',
 'anambra',
 'fashion',
 'expo',
 '2023',
 'model',
 'call',
 '.',
 'we',
 'are',
 'thrilled',
 'to',
 'have',
 'received',
 'your',
 'information',
 'and',
 'are',
 'excited',
 'to',
 'review',
 'your',
 'submission',
 '.',
 'our',
 'team',
 'will',
 'be',
 'carefully',
 'reviewing',
 'all',
 'of',
 'the',
 'applications',
 'we',
 'receive',
 'over',
 'the',
 'next',
 'few',
 'weeks',
 'have',
 'you',
 'followed',
 'us',
 'on',
 'our',
 'social',
 'media',
 'handles',
 '?',
 'remember',
 ',',
 'one',
 'of',
 'the',
 'prerequisites',
 'for',
 'qualification',
 'is',
 'to',
 'follow',
 'all',
 'our',
 'social',
 'media',
 'accounts',
 'and',
 'share',
 'all',
 'our',
 'content',
 'using',
 'the',
 'hashtag',
 '#afe2023',
 'you',
 'can',
 'follow',
 'us',
 'on',
 'facebook',
 ',',
 'instagram',
 ',',
 'and',
 'twitter',
 '.',
 'facebook',
 'https',
 '//facebook',
 '.',

<hr/>

### **2. Stop word removal**
- Eliminate common words that do not contribute to the meaning
- Stop words: "a", "the", "and", "or", and more

In [29]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

print(filtered_tokens)

['reading', 'book', '.', 'love', 'read', 'books', '!']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [30]:
def remove_stopwords(tokens):
    return [token for token in tokens if token.lower() not in stop_words]

In [32]:
df['remove_stopwords'] = df['text_tokens'].apply(remove_stopwords)
df.head()

Unnamed: 0,title,text,type,text_tokens,remove_stopwords
0,?? the secrets to SUCCESS,"Hi James,\n\nHave you claim your complimentary...",spam,"[hi, james, ,, have, you, claim, your, complim...","[hi, james, ,, claim, complimentary, gift, yet..."
1,?? You Earned 500 GCLoot Points,"\nalt_text\nCongratulations, you just earned\n...",not spam,"[alt_text, congratulations, ,, you, just, earn...","[alt_text, congratulations, ,, earned, 500, co..."
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\n ...",not spam,"[here, ', s, your, github, launch, code, ,, @m...","[', github, launch, code, ,, @mortyj420, !, oc..."
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\n \nThank you for contacting the Virtua...",not spam,"[hello, ,, thank, you, for, contacting, the, v...","[hello, ,, thank, contacting, virtual, reward,..."
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\n\nToday's newsletter is ...",spam,"[hey, prachanda, rawal, ,, today, ', s, newsle...","[hey, prachanda, rawal, ,, today, ', newslette..."


In [33]:
df['remove_stopwords'].sample().item()

['bolt',
 'rides',
 'affordable',
 ',',
 'also',
 'come',
 'number',
 'safety',
 'features',
 '.',
 'things',
 'ensure',
 'safety',
 'train',
 'drivers',
 'verify',
 'identity',
 'app',
 '‘share',
 'eta’',
 'emergency',
 'assist',
 'buttons',
 'use',
 'rider',
 'feedback',
 'improve',
 'future',
 'journeys',
 '.',
 'learn',
 'bolt’s',
 'safety',
 'measures',
 '.',
 'open',
 'app',
 'safe',
 'travels',
 '!',
 '?',
 '?',
 'bolt',
 'team',
 'n',
 '.',
 'b',
 '.',
 'limited-time',
 'offer',
 'use',
 '2023-07-18',
 '2023-07-24',
 'nairobi',
 ',',
 'kenya',
 '.',
 'information',
 'app',
 '.',
 'discount',
 'valid',
 '250',
 'kes',
 'per',
 'trip',
 '.']

<hr/>

### **3. Stemming**
- Reducing words to their base form
- For example: "running", "runs", "ran" becomes run

In [34]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
filtered_tokens = ["reading", "book", ".", "love", "read", "books", "!"]
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

print(stemmed_tokens)

['read', 'book', '.', 'love', 'read', 'book', '!']


In [35]:
def stemming(filtered_tokens):
    return [stemmer.stem(token) for token in filtered_tokens]

In [36]:
df['steemed_tokens'] = df['remove_stopwords'].apply(stemming)
df.head()

Unnamed: 0,title,text,type,text_tokens,remove_stopwords,steemed_tokens
0,?? the secrets to SUCCESS,"Hi James,\n\nHave you claim your complimentary...",spam,"[hi, james, ,, have, you, claim, your, complim...","[hi, james, ,, claim, complimentary, gift, yet...","[hi, jame, ,, claim, complimentari, gift, yet,..."
1,?? You Earned 500 GCLoot Points,"\nalt_text\nCongratulations, you just earned\n...",not spam,"[alt_text, congratulations, ,, you, just, earn...","[alt_text, congratulations, ,, earned, 500, co...","[alt_text, congratul, ,, earn, 500, complet, f..."
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\n ...",not spam,"[here, ', s, your, github, launch, code, ,, @m...","[', github, launch, code, ,, @mortyj420, !, oc...","[', github, launch, code, ,, @mortyj420, !, oc..."
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\n \nThank you for contacting the Virtua...",not spam,"[hello, ,, thank, you, for, contacting, the, v...","[hello, ,, thank, contacting, virtual, reward,...","[hello, ,, thank, contact, virtual, reward, ce..."
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\n\nToday's newsletter is ...",spam,"[hey, prachanda, rawal, ,, today, ', s, newsle...","[hey, prachanda, rawal, ,, today, ', newslette...","[hey, prachanda, rawal, ,, today, ', newslett,..."


In [37]:
df['steemed_tokens'].sample().item()

['sale',
 'execut',
 'dear',
 'hire',
 'profession',
 'contact',
 'express',
 'interest',
 'sale',
 'execut',
 'posit',
 '.',
 'review',
 'posit',
 'requir',
 ',',
 'believ',
 'educ',
 'experi',
 'great',
 'match',
 'posit',
 '.',
 'energet',
 'decis',
 'negoti',
 'skill',
 'goal-set',
 ',',
 'sale',
 'recommend',
 'product',
 '.',
 'natur',
 'talent',
 'build',
 'immedi',
 'rapport',
 'peopl',
 'cultiv',
 'product',
 'connect',
 '.',
 'likewis',
 ',',
 'fulli',
 'capabl',
 'work',
 'strong',
 'person',
 'navig',
 'high-pressur',
 'situat',
 '.',
 'detail',
 ',',
 'pleas',
 'review',
 'attach',
 'resum',
 '.',
 'believ',
 'sale',
 'execut',
 "'",
 'look',
 'welcom',
 'opportun',
 'speak',
 'earliest',
 'conveni',
 '.',
 'sincer',
 ',',
 'zandi',
 'taman']

<hr/>

### **4. Rare word removal**
- Removing infrequent words that don't add value

![28e3fa6560d2ac80296183f5cea80447-1815626993](https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/b5c58500-a539-4042-ba89-4b8db816e359)

In [15]:
from nltk.probability import FreqDist

stemmed_tokens = ["read", "book", ".", "love", "read", "book", "!"]
freq_dist = FreqDist(stemmed_tokens)
threshold = 1

common_tokens = [token for token in stemmed_tokens if freq_dist[token] > threshold]
print(common_tokens)

['read', 'book', 'read', 'book']


In [16]:
def remove_rare(stemmed_tokens):
    freq_dist = FreqDist(stemmed_tokens)
    return [token for token in stemmed_tokens if freq_dist[token] > 1]

In [38]:
df['rare_words_removed'] = df['steemed_tokens'].apply(remove_rare)
df.head()

Unnamed: 0,title,text,type,text_tokens,remove_stopwords,steemed_tokens,rare_words_removed
0,?? the secrets to SUCCESS,"Hi James,\n\nHave you claim your complimentary...",spam,"[hi, james, ,, have, you, claim, your, complim...","[hi, james, ,, claim, complimentary, gift, yet...","[hi, jame, ,, claim, complimentari, gift, yet,...","[,, claim, gift, ?, gift, ?, ., >>, claim, >>,..."
1,?? You Earned 500 GCLoot Points,"\nalt_text\nCongratulations, you just earned\n...",not spam,"[alt_text, congratulations, ,, you, just, earn...","[alt_text, congratulations, ,, earned, 500, co...","[alt_text, congratul, ,, earn, 500, complet, f...","[,, earn, point, earn, point, ,, ,, ,, hong, k..."
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\n ...",not spam,"[here, ', s, your, github, launch, code, ,, @m...","[', github, launch, code, ,, @mortyj420, !, oc...","[', github, launch, code, ,, @mortyj420, !, oc...","[github, code, github, code, github]"
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\n \nThank you for contacting the Virtua...",not spam,"[hello, ,, thank, you, for, contacting, the, v...","[hello, ,, thank, contacting, virtual, reward,...","[hello, ,, thank, contact, virtual, reward, ce...","[,, thank, contact, virtual, reward, center, ...."
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\n\nToday's newsletter is ...",spam,"[hey, prachanda, rawal, ,, today, ', s, newsle...","[hey, prachanda, rawal, ,, today, ', newslette...","[hey, prachanda, rawal, ,, today, ', newslett,...","[,, today, ', day, ,, insid, play, ,, video, p..."


- This is how the final preprocessed text data would look like:

In [39]:
proccessed_text = df['rare_words_removed']
proccessed_text.sample(10)

59    [netflix, ,, ,, find, inform, request, netflix...
81    [vladis163ru, steam, guard, code, login, accou...
38                                            [., ., .]
44    [notic, login, ,, alexxuzi, notic, login, devi...
9     [,, interest, join, appen, !, email, invit, ne...
52    [,, team, repli, ., repli, ., ., ., team, ,, p...
23                                                   []
74                                  [., ., ., ==>, ==>]
68    [,, 2023, ,, make, chang, googl, play, term, s...
62    [,, paypal, $8, ,, 32, usd, ., transact, trans...
Name: rare_words_removed, dtype: object

<hr/>

## Preprocessing techniques Recap
- Tokenization
- stopword removal
- stemming
- rare word removal
- More techniques exist

<hr/>

## <div align ="center"> Encoding techniques </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/7bdbe523-7e07-442b-87b9-e35e602d49f5" height="120"/>
</div>

### Motivation
- covert text into machine-readable numbers
- Enable analysis and modeling

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/ac62b9af-5c9d-4bc5-a643-3df0ee31c394" height="500"/>
</div>

## 

- Allows models to understand and process text
- Choose one technique to avoid redudancy
- More techniques exist

## Encoding Techniques
- One-hot encoding: transforms words into unique numerical representations
- Bag-of-Words (BoW): captures word frequency, disregarding order
- TF-IDF: balances uniqueness and importance
<hr/>

### **1. One-hot encoding**
- Mapping each word to a distinct vector

Binary vector:
- 1 for the presence of a word
- 0 for the absence of a word

['cat', 'dog', 'rabbit']

'cat' [1, 0, 0]

'dog' [0, 1, 0]

'rabbit' [0, 0, 1]

In [15]:
import torch

vocab = ['cat', 'dog', 'rabbit']
vocab_size = len(vocab)

one_hot_vectors = torch.eye(vocab_size)
one_hot_dict = {word: one_hot_vectors[i] for i, word in enumerate(vocab)}

print(one_hot_dict)

{'cat': tensor([1., 0., 0.]), 'dog': tensor([0., 1., 0.]), 'rabbit': tensor([0., 0., 1.])}


In [40]:
def one_hot_encoding(row):
    vocab = set(row)
    vocab_size = len(vocab)
    one_hot_vectors = torch.eye(vocab_size)
    # return {word: one_hot_vectors[i] for i, word in enumerate(vocab)}
    return [one_hot_vectors[i] for i, word in enumerate(vocab)]

In [42]:
df['ohe'] = df['rare_words_removed'].apply(one_hot_encoding)
df['ohe'].head()

0    [[tensor(1.), tensor(0.), tensor(0.), tensor(0...
1    [[tensor(1.), tensor(0.), tensor(0.), tensor(0...
2    [[tensor(1.), tensor(0.)], [tensor(0.), tensor...
3    [[tensor(1.), tensor(0.), tensor(0.), tensor(0...
4    [[tensor(1.), tensor(0.), tensor(0.), tensor(0...
Name: ohe, dtype: object

In [45]:
df['ohe'][1]

[tensor([1., 0., 0., 0., 0.]),
 tensor([0., 1., 0., 0., 0.]),
 tensor([0., 0., 1., 0., 0.]),
 tensor([0., 0., 0., 1., 0.]),
 tensor([0., 0., 0., 0., 1.])]

<hr/>

### **2. Bag of words**
- Example: "The cat sat on the mat"
- Bag-of-words: {'the': 2, 'cat': 1, 'sat': 1, 'on': 1, 'mat': 1}

- Treating each document as an unordered collection of words
-  Focuses on frequency, not order

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?']
X = vectorizer.fit_transform(corpus)

print(X.toarray())
print(vectorizer.get_feature_names_out())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


<hr/>

### **3. TF-IDF**
Term Frequency-Inverse Document Frequency
- Scores the importance of words in a document
- Rare words have a higher score
- Common ones have a lower score
- Emphasizes informative words

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
corpus = ['This is the first document.','This document is the second document.', 'And this is the third one.','Is this the first document?']

X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_feature_names_out())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


<hr/>

## Encodong Techniques REcap
- One hot encoding
- Words of bags
- TF-IDF encoding
- More techniques exist