### Data Mining Process (NPL Level 1):

#### Data Preprocessing & Data Normalization (Cleaning and Transforming)

- **_Tokenization(Data Preprocessing):_**
    - It split the paragraph into words and sentences. 
    - Word Tokenization `(nltk.word_tokenize(word))`
    - Sentence Tokenization `(nltk.sent_tokenization(Paragraph))`
- **_StopWord removal (Data Preprocessing):_**
    - Its removing the `(i,me, we,that etc from nltk.corpus import stopwords)`
    - `stopwords = nltk.corpus.stopwords.words('english')`
    - stopwrods = [w for w in words if w not in stopwords]
- **_Stemming (Data Normalization):_**
    - It remove the suffixes from words to obtain the root word.
    - `PorterStemmer() Algorithm`  Its very basic and widely used.
    - `SnowballStemmer() Algorithm` Its extension of porterstemmer and avilable in multi language.
    - `LancasterStemmer() Algorithm` Its very aggressive to convert words into root words.
- **_Lemmatization (Data Normalization):_**
    - Unlike to Stemming Lemmatization does not remove the suffixes, but converts the word into dictionary form with respect to context.
    - Lemmatization results in real words that make linguistic sense.
    - `WordNetLemmatizer()` is a popular algorithm for lemmatization.
    - lemma = nltk.stem.WordNetLemmatizer()
    - lemma.lemmatize(word)
- **_POS Tagging_**
    - `nltk.pos_tag(lammas)`
    -  POS tagging is very cruicial and fundamental step in NPL, Sentiment may based on Adjective and Adverbs, and POS tagging helps us to determine the sentiment of the sentence. 
- **_BoW [Method]:_**
    - Bag of Words is mehtod in which we convert the words into vectors, using the frequency of the word. Each word is represented by a vector and the vectors are combined to form a matrix. each word treated as a column in the matrix(Features) and treated independently.
    - As each word is treated as a feature, it is called Bag of words, results its doesn't handle the order of the words in a sentence nor plural words or synonyms.
    - `CountVectorizer()` from sklearn.feature_extraction.text import CountVectorizer`
- **_TF-IDF [Method]:_**
    - TF-IDF is a measure of how important a word is to a document in a collection of documents.
    - as good is frequently used in three documents which mean its not useful, hence got lower TF-IDF score. 
    TF-IDF is kind of an extension of BoW method.
    - `TfidfVectorizer()` from sklearn.feature_extraction.text import TfidfVectorizer`
    - TfidfVectorizer required input in list of string not list of lists, mean after lemmatization you have to join them back.
- **_Word2vec_**:
    - Word2vec is word embedding technique in NLP tasks.
    - Word2Vec assigns each word in a given corpus a high-dimensional vector representation. This representation is learned by training a neural network on a large text dataset. The idea is that words with similar meanings will have similar vector.
    - Word2vec is available in `gensim` library 
    - from `gensim.model` import `word2vec` 
    - `model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)`
    - sg: Skip-gram model (1) or Continuous Bag of Words (0).
    - vector_size: The dimensionality of the Word2Vec word vectors. (Commom value is 100 - 300 smalller size when we have lower dataset)
    - window: The maximum distance between the current and predicted word within a sentence. (range between 2 - 5)
    - min_count: Ignores all words with a total frequency lower than this.
    - sentences should be in list of list not a flat list (e.g. {'I', 'love', 'Python'})

`

In [7]:
import pandas as pd
import re
import nltk
import sklearn
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Define the pipeline globally
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LogisticRegression())
])

def preprocess_text(text: str) -> str:
    """
    Preprocesses the text by removing special characters, converting to lowercase,
    and lemmatizing the words.
    
    Args:
        text (str): The input text to be preprocessed.
    
    Returns:
        str: The preprocessed text.
    """
    lemma = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    
    msgs = re.sub('[^a-zA-Z]', ' ', text).lower().split()  # removing special characters and converting to lowercase
    msgs = [lemma.lemmatize(word) for word in msgs if word not in stop_words]  # lemmatizing the words and removing stopwords
    msgs = ' '.join(msgs)  # joining the words again to form a sentence 
    
    return msgs

def train_and_evaluate_model(df: pd.DataFrame) -> float:
    """
    Trains and evaluates a logistic regression model on the given dataframe.
    
    Args:
        df (pd.DataFrame): The input dataframe containing the dataset.
    
    Returns:
        float: The accuracy score of the model.
    """
    df['Spam'] = pd.get_dummies(df['Category'], drop_first=True)  # encoding the target variable
    y = df['Spam']  # target variable
    messages = df['Message']  # feature / input variable
    corpus = [preprocess_text(i) for i in messages]  # preprocessing the input text
    
    x_train, x_test, y_train, y_test = train_test_split(corpus, y, test_size=0.2, random_state=42)  # splitting the dataset into train and test
    pipeline.fit(x_train, y_train)
    pred = pipeline.predict(x_test)
    acc_score = accuracy_score(y_test, pred)
    
    return acc_score

# Reading the dataset:
df = pd.read_csv('spam.csv')

# Train and evaluate the model:
accuracy = train_and_evaluate_model(df)
print(f"Accuracy Score: {accuracy}")

# Reading the User email as input:    
user_input = input("Enter your email: ")
user_email = preprocess_text(user_input)
prediction = pipeline.predict([user_email])
if prediction[0] == 0:
    print("Your email is Not Spam")
else:
    print("Your email is Spam! Hope you don't get it")

Accuracy Score: 0.968609865470852
Your email is Spam! Hope you don't get it


### Model Using text embeding technique (Word2vec) - ML Model Dicesion Tree