# Text Pre-Processinf and Feature Engineering

In this notebook, we will explore several methods of text classification and feature extraction, focusing on various techniques and their practical applications. Topics covered include:

- **Vector Space Representation**
- **Feature Engineering**
- **Text Pre-processing**
  - Tokenization Issues
  - Stop Words & Normalization
  - Lemmatization & Stemming
- **Real World Issues**

In [None]:
!pip install nltk

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

In [None]:
# Download necessary NLTK resources
#nltk.download('punkt')
#nltk.download('stopwords')
#nltk.download('wordnet')
nltk.download('all')

## Vector Space Representation

Text classification can be viewed as a vector space model where documents are represented as vectors in a multi-dimensional space. Each dimension corresponds to a term in the document corpus. We will explore **CountVectorizer** and **TF-IDF** as methods for feature extraction in text classification.

### Sample text documents for demonstration

In [None]:
# Sample text documents for demonstration
documents = [
    "Data Mining involves discovering patterns in large datasets to extract useful knowledge.",
    "Data Science is a multidisciplinary field that uses scientific methods to extract knowledge from structured and unstructured data.",
    "Machine Learning is a subset of Artificial Intelligence that enables systems to learn from data and improve without explicit programming.",
    "Artificial Intelligence refers to the simulation of human intelligence in machines that are programmed to think and learn."
]


### Increse similarity between documents

In [None]:
# Increse similarity between documents
documents = [
    "Data Mining involves discovering patterns in large datasets, applying scientific methods to extract useful knowledge and insights from both structured and unstructured data.",
    "Data Science is a multidisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from both structured and unstructured data, similar to Data Mining.",
    "Machine Learning, a subset of Artificial Intelligence, uses algorithms and statistical models to learn from data and improve predictions or decisions without explicit programming, closely related to Data Mining and Data Science.",
    "Artificial Intelligence encompasses systems that simulate human intelligence, including machine learning algorithms that learn from data to improve performance, making it closely related to Data Mining, Data Science, and Machine Learning."
]


### Use CountVectorizer to transform text into feature vectors

In [None]:
# Use CountVectorizer to transform text into feature vectors
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Display the feature vectors
X.toarray()

## Feature Engineering

Feature engineering is a crucial step in text classification. The raw text data must be transformed into numerical features that machine learning models can process. We will explore methods like tokenization, removing stop words, and normalization of terms.

In [None]:
# Tokenization and Stop Words Removal
nltk.download('all')
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text)
    # Remove punctuation and stop words
    tokens = [word.lower() for word in tokens if word.isalpha() and word.lower() not in stop_words]
    return tokens

In [None]:
# Preprocess the documents
preprocessed_documents = [preprocess_text(doc) for doc in documents]
preprocessed_documents

## Lemmatization and Stemming

Lemmatization and stemming are techniques for reducing words to their base or root form. Stemming often removes prefixes and suffixes, while lemmatization maps a word to its dictionary form.
We will demonstrate both techniques using the `PorterStemmer` and `WordNetLemmatizer`.

In [None]:
# Lemmatization and Stemming Example
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def lemmatize_and_stem(tokens):
    lemmatized = [lemmatizer.lemmatize(word) for word in tokens]
    stemmed = [stemmer.stem(word) for word in tokens]
    return lemmatized, stemmed

# Apply lemmatization and stemming
lemmatized_documents, stemmed_documents = zip(*[lemmatize_and_stem(doc) for doc in preprocessed_documents])
lemmatized_documents, stemmed_documents