## <div align="center"> Text processing pipeline </div>
Clean and prepare text for classification tasks and others.

<hr/>

## <div align ="center">  Pipeline of handling text data sets </div>

<div align ="center"> 
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/b7baeb0c-92dc-4cd1-b99a-8dbe9bb7f9d0" height="200">
</div>

<hr/>

### Motivation
- Reduce features
- Cleaner, more representative datasets

### Use cases
- Sentiment analysis
- Text summarization
- Machine translation

<hr/>

## <div align ="center"> Tools </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/c15b4b04-fc4c-4d09-97e9-1ff0d1697186" height="200"/>
</div>

<hr/>

### Example Dataset: Ham or Spam

In [62]:
import warnings
import pandas as pd
warnings.filterwarnings("ignore")

- Reading the data from the source file: 

In [4]:
df = pd.read_csv("./email_spam.csv")
df.head()

Unnamed: 0,title,text,type
0,?? the secrets to SUCCESS,"Hi James,\n\nHave you claim your complimentary...",spam
1,?? You Earned 500 GCLoot Points,"\nalt_text\nCongratulations, you just earned\n...",not spam
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\n ...",not spam
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\n \nThank you for contacting the Virtua...",not spam
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\n\nToday's newsletter is ...",spam


- Here is a sample of the emails

"Hi Walid,

Do you listen to music on Spotify, YouTube, Amazon or Apple?

If you do - you qualify!

You could be making $50 for every song you stream...

All it takes is 3 steps...

Step 1: Create Your Account
Create your account here

Step 2: Pick Your Favourite Artist
Select from thousands of artists and vibe to the music

Step 3: Get Paid
That's it, for every song you stream...

=> Click here right now to start instantly

Regards,

Alex

---
?? Connect with us on Telegram: https://t.me/moneymakingcentral"

## <div align ="center"> Preprocessing techniques </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/47bf60bf-3078-47c1-bb58-2f72b9c9a9f2" height="200">
</div>

- Tokenization
- Stop word removal
- Stemming
- Rare word removal

<hr/>

### 1. **Tokenization**
- Tokens or words are extracted from text
- Tokenization using torchtext.

In [29]:
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")
tokens = tokenizer("I am reading a book now. I love to read books!")

print(tokens)

['i', 'am', 'reading', 'a', 'book', 'now', '.', 'i', 'love', 'to', 'read', 'books', '!']


In [30]:
df['text_tokens'] = df['text'].apply(tokenizer)
df.head()

Unnamed: 0,title,text,type,text_tokens
0,?? the secrets to SUCCESS,"Hi James,\n\nHave you claim your complimentary...",spam,"[hi, james, ,, have, you, claim, your, complim..."
1,?? You Earned 500 GCLoot Points,"\nalt_text\nCongratulations, you just earned\n...",not spam,"[alt_text, congratulations, ,, you, just, earn..."
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\n ...",not spam,"[here, ', s, your, github, launch, code, ,, @m..."
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\n \nThank you for contacting the Virtua...",not spam,"[hello, ,, thank, you, for, contacting, the, v..."
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\n\nToday's newsletter is ...",spam,"[hey, prachanda, rawal, ,, today, ', s, newsle..."


### **2.Stop word removal**
- Eliminate common words that do not contribute to the meaning
- Stop words: "a", "the", "and", "or", and more

In [32]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

print(filtered_tokens)

['reading', 'book', '.', 'love', 'read', 'books', '!']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [35]:
def remove_stopwords(tokens):
    return [token for token in tokens if token.lower() not in stop_words]

In [38]:
df['remove_stopwords'] = df['text_tokens'].apply(remove_stopwords)
df.head()

Unnamed: 0,title,text,type,text_tokens,remove_stopwords
0,?? the secrets to SUCCESS,"Hi James,\n\nHave you claim your complimentary...",spam,"[hi, james, ,, have, you, claim, your, complim...","[hi, james, ,, claim, complimentary, gift, yet..."
1,?? You Earned 500 GCLoot Points,"\nalt_text\nCongratulations, you just earned\n...",not spam,"[alt_text, congratulations, ,, you, just, earn...","[alt_text, congratulations, ,, earned, 500, co..."
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\n ...",not spam,"[here, ', s, your, github, launch, code, ,, @m...","[', github, launch, code, ,, @mortyj420, !, oc..."
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\n \nThank you for contacting the Virtua...",not spam,"[hello, ,, thank, you, for, contacting, the, v...","[hello, ,, thank, contacting, virtual, reward,..."
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\n\nToday's newsletter is ...",spam,"[hey, prachanda, rawal, ,, today, ', s, newsle...","[hey, prachanda, rawal, ,, today, ', newslette..."


### **3. Stemming**
- Reducing words to their base form
- For example: "running", "runs", "ran" becomes run

In [39]:
import nltk
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
filtered_tokens = ["reading", "book", ".", "love", "read", "books", "!"]
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

print(stemmed_tokens)

['read', 'book', '.', 'love', 'read', 'book', '!']


In [40]:
def stemming(filtered_tokens):
    return [stemmer.stem(token) for token in filtered_tokens]

In [42]:
df['steemed_tokens'] = df['remove_stopwords'].apply(stemming)
df.head()

Unnamed: 0,title,text,type,text_tokens,remove_stopwords,steemed_tokens
0,?? the secrets to SUCCESS,"Hi James,\n\nHave you claim your complimentary...",spam,"[hi, james, ,, have, you, claim, your, complim...","[hi, james, ,, claim, complimentary, gift, yet...","[hi, jame, ,, claim, complimentari, gift, yet,..."
1,?? You Earned 500 GCLoot Points,"\nalt_text\nCongratulations, you just earned\n...",not spam,"[alt_text, congratulations, ,, you, just, earn...","[alt_text, congratulations, ,, earned, 500, co...","[alt_text, congratul, ,, earn, 500, complet, f..."
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\n ...",not spam,"[here, ', s, your, github, launch, code, ,, @m...","[', github, launch, code, ,, @mortyj420, !, oc...","[', github, launch, code, ,, @mortyj420, !, oc..."
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\n \nThank you for contacting the Virtua...",not spam,"[hello, ,, thank, you, for, contacting, the, v...","[hello, ,, thank, contacting, virtual, reward,...","[hello, ,, thank, contact, virtual, reward, ce..."
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\n\nToday's newsletter is ...",spam,"[hey, prachanda, rawal, ,, today, ', s, newsle...","[hey, prachanda, rawal, ,, today, ', newslette...","[hey, prachanda, rawal, ,, today, ', newslett,..."


### **4. Rare word removal**
- Removing infrequent words that don't add value

In [52]:
from nltk.probability import FreqDist

stemmed_tokens= ["read", "book", ".", "love", "read", "book", "!"]
freq_dist = FreqDist(stemmed_tokens)
threshold = 1

common_tokens = [token for token in stemmed_tokens if freq_dist[token] > threshold]
print(common_tokens)

['read', 'book', 'read', 'book']


In [58]:
def remove_rare(stemmed_tokens):
    freq_dist = FreqDist(stemmed_tokens)
    return [token for token in stemmed_tokens if freq_dist[token] > 1]

In [59]:
df['rare_words_removed'] = df['steemed_tokens'].apply(remove_rare)
df.tail()

Unnamed: 0,title,text,type,text_tokens,remove_stopwords,steemed_tokens,rare_words_removed
79,Your application for the position of Child Pr...,"Dear Maryam, \n\n \n\nI would like to thank yo...",not spam,"[dear, maryam, ,, i, would, like, to, thank, y...","[dear, maryam, ,, would, like, thank, applicat...","[dear, maryam, ,, would, like, thank, applic, ...","[,, applic, ,, ., applic, ., ,, .]"
80,Your Kilimall Account is Ready - Shopping Now!,"Dear Customer,\n\nWelcome to Kilimall, Thanks ...",not spam,"[dear, customer, ,, welcome, to, kilimall, ,, ...","[dear, customer, ,, welcome, kilimall, ,, than...","[dear, custom, ,, welcom, kilimal, ,, thank, m...","[custom, ,, kilimal, ,, much, ., kilimal, afri..."
81,Your Steam account: Access from new web or mob...,"Dear vladis163rus,\nHere is the Steam Guard co...",not spam,"[dear, vladis163rus, ,, here, is, the, steam, ...","[dear, vladis163rus, ,, steam, guard, code, ne...","[dear, vladis163ru, ,, steam, guard, code, nee...","[vladis163ru, steam, guard, code, login, accou..."
82,Your uploaded document is rejected,View In Browser | Log in\n \n \n\nSkrill logo\...,not spam,"[view, in, browser, |, log, in, skrill, logo, ...","[view, browser, |, log, skrill, logo, money, m...","[view, browser, |, log, skrill, logo, money, m...","[|, skrill, money, ?, ?, couldn’t, verifi, add..."
83,You've Earned a Reward from Bard Explorers India,You've received a gift!\nSign in to your Bard ...,not spam,"[you, ', ve, received, a, gift, !, sign, in, t...","[', received, gift, !, sign, bard, explorers, ...","[', receiv, gift, !, sign, bard, explor, india...","[gift, !, bard, explor, india, commun, member,..."


- This is how the final preprocessed text data would look like:

In [61]:
proccessed_text = df['rare_words_removed']
proccessed_text.sample(20)

35    [bolt, ,, safeti, ., safeti, app, use, ., safe...
52    [,, team, repli, ., repli, ., ., ., team, ,, p...
48    [', ', ., ,, ,, ,, ,, ,, ,, ,, variou, cytonn,...
37    [admin, assist, admin, assist, opportun, compa...
27    [', year, ', ., win, project, year, ?, win, ?,...
30            [singl, look, ?, ?, ', ', look, ', singl]
22                                                   []
75    [!, applic, process, job, -, need, (, ), (, jo...
76                   [notic, login, ., notic, login, .]
19    [amazon, prime, free, trial, cancel, ., amazon...
17    [pleas, ,, scholarship, allow, degre, ,, ., jo...
80    [custom, ,, kilimal, ,, much, ., kilimal, afri...
24    [respond, feedback, feedback, us, ., ', ,, us,...
2                  [github, code, github, code, github]
18    [jobstreet, ., com, ,, job, ., jobstreet, ., c...
61    [,, file, ., today, ,, file, return, ., file, ...
41    [,, zoom, call, -, ,, (, ), ., regist, free, s...
64                 [tv, ., ', tv, ', tv, ., ., .

<hr/>

## Preprocessing techniques Recap
- Tokenization
- stopword removal
- stemming
- rare word removal
- More techniques exist

<hr/>

## <div align ="center"> Encoding techniques </div>

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/7bdbe523-7e07-442b-87b9-e35e602d49f5" height="120"/>
</div>

### Motivation
- covert text into machine-readable numbers
- Enable analysis and modeling

<div align ="center">
    <img src="https://github.com/sondosaabed/Data-Analyst-Nanodegree/assets/65151701/ac62b9af-5c9d-4bc5-a643-3df0ee31c394" height="500"/>
</div>