<a href="https://colab.research.google.com/github/vibhuverma17/AUTOML/blob/main/AUTONLP_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Common Steps in an NLP Pipeline

A typical **NLP pipeline** consists of several key steps that transform raw text data into structured output for analysis or model training. Below are the common steps involved:

### 1. Text Collection and Data Preprocessing
- **Data Collection**: Gathering raw text data from various sources like web scraping, APIs, databases, or uploaded datasets.
- **Cleaning**: Removing unnecessary or irrelevant information (e.g., HTML tags, special characters, stop words).
- **Lowercasing**: Converting all text to lowercase to avoid distinguishing between words based on case.
- **Removing Noise**: Cleaning the data by removing numbers, punctuation, special characters, or irrelevant content.

### 2. Tokenization
- **Word Tokenization**: Splitting the text into individual words or tokens. For example, the sentence "I love NLP" would be tokenized into `["I", "love", "NLP"]`.
- **Sentence Tokenization**: Splitting the text into sentences for sentence-level tasks.
- **Subword Tokenization**: Sometimes, especially with languages that have compound words (e.g., German), breaking the text into subwords may be useful.

### 3. Stop Words Removal
- Stop words are common words like "and", "the", "is" that do not contribute much meaning to the text analysis.
- These words are usually removed unless they serve a specific purpose in the context.

### 4. Text Normalization
- **Stemming**: Reducing words to their root form (e.g., "running" -> "run").
- **Lemmatization**: Converting words to their base or dictionary form (e.g., "better" -> "good"). Lemmatization is more sophisticated than stemming as it considers the context and part of speech.
- **Spell Correction**: Correcting common spelling errors in the text.

### 5. Part-of-Speech Tagging (POS Tagging)
- Assigning a part-of-speech label (e.g., noun, verb, adjective) to each word in a sentence, which helps in understanding the grammatical structure and meaning.

### 6. Named Entity Recognition (NER)
- Identifying and classifying entities in the text (e.g., names of people, places, dates, organizations).
- Example: In the sentence "Apple is releasing a new product in New York on March 10th," "Apple" (organization), "New York" (location), and "March 10th" (date) are recognized.

### 7. Vectorization/Feature Extraction
- **Bag of Words (BoW)**: Representing text as a set of word counts or frequencies.
- **TF-IDF (Term Frequency-Inverse Document Frequency)**: A statistical measure to evaluate how important a word is to a document in a collection.
- **Word Embeddings**: Converting words to dense vectors (e.g., Word2Vec, GloVe, FastText), capturing semantic meaning and relationships between words.
- **Transformers-based Embeddings**: Using models like BERT, GPT, or T5 to get contextualized word embeddings.

### 8. Text Representation (Optional)
- Depending on the task, you may also want to represent the entire document or sentence using embeddings like Doc2Vec, sentence embeddings, or transformer-based models like BERT for more advanced NLP tasks.

### 9. Modeling
- **Supervised Learning**: Training a model for tasks like classification, regression, or sequence labeling (e.g., sentiment analysis, text classification, named entity recognition).
- **Unsupervised Learning**: Techniques like clustering (e.g., K-means) or topic modeling (e.g., Latent Dirichlet Allocation - LDA).
- **Deep Learning**: Using advanced architectures like LSTM, GRU, BERT, or GPT for more complex tasks.

### 10. Evaluation
- Evaluating model performance using metrics like accuracy, precision, recall, F1-score for classification tasks, or BLEU score for text generation tasks.
- **Cross-validation** may be used to assess the model's robustness and generalization.

### 11. Post-Processing
- **Text Generation or Transformation**: Generating text, summarization, translation, or other transformations based on the model's output.
- **Filtering/Thresholding**: For tasks like sentiment analysis, setting a threshold to classify text as positive or negative.

### 12. Deployment and Monitoring
- Once the model is trained and evaluated, it is deployed into a production environment.
- **Real-time inference**: Text data is passed through the model for prediction.
- **Monitoring**: Tracking the model's performance over time to ensure it continues to work effectively, including retraining with new data as necessary.

## Summary

The NLP pipeline typically starts with **data collection and preprocessing**, followed by **tokenization**, **stop words removal**, and **text normalization**. Then, it proceeds to **feature extraction** and **modeling**, followed by **evaluation**. Finally, it may include **post-processing** and **deployment** for real-world applications.

THIS AUTOMML USES TYPICAL MACHINE LEARNING FRAMEWORK NOT DEEP LEARNING

In [None]:
!pip install autoviml

In [None]:
import tensorflow_datasets as tfds
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, confusion_matrix
from autoviml import Auto_NLP
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
datasets, info = tfds.load('imdb_reviews', with_info=True,batch_size=-1)

In [None]:
# Convert datasets to DataFrames
train_df = pd.DataFrame(datasets['train'])
test_df = pd.DataFrame(datasets['test'])
train_df['text'] = train_df['text'].astype(str)
test_df['text'] = test_df['text'].astype(str)

# Map the label: 1 = Positive, 0 = Negative
train_df['label'] = train_df['label'].map({1: 'Positive', 0: 'Negative'})
test_df['label'] = test_df['label'].map({1: 'Positive', 0: 'Negative'})

# Let's inspect the first few rows of the training set
print(train_df.head())

# Step 2: Split train_df into training and validation sets (for K-Fold Cross Validation)
input_columns = ['text']
target_column = 'label'

# Step 3: Set up K-Fold Cross Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)  # 5-fold cross-validation
best_model = None
best_accuracy = 0
fold = 1
validation_results = []  # To store validation results for each fold

# Loop through each fold
for train_index, val_index in kf.split(train_df):
    print(f"Training fold {fold}...")

    # Split into training and validation sets
    train_fold = train_df.iloc[train_index]
    val_fold = train_df.iloc[val_index]

    # Set up Auto_NLP pipeline for this fold
    auto_nlp = Auto_NLP(nlp_column='text',  # Text column
                        train=train_fold,   # Training dataset
                        test=val_fold,      # Validation dataset
                        target=target_column,verbose=2)  # Target column

    # Train the model on the current fold's training data
    model = auto_nlp.fit()

    # Make predictions on the validation data
    val_preds = model.predict(val_fold)

    # Calculate accuracy for this fold
    accuracy = accuracy_score(val_fold[target_column], val_preds)
    validation_results.append(accuracy)

    print(f"Validation Accuracy for fold {fold}: {accuracy}")

    # Check if this is the best model so far
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = model

    fold += 1

# After K-Fold Cross-Validation, evaluate the best model on the test set
test_preds = best_model.predict(test_df)

In [None]:
# Step 4: Calculate confusion matrix on the test set
conf_matrix = confusion_matrix(test_df[target_column], test_preds)

# Step 5: Display confusion matrix
print("Confusion Matrix on the Test Set:")
print(conf_matrix)

# Plot confusion matrix
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

# Optionally, calculate other performance metrics (precision, recall, F1-score)
print(classification_report(test_df[target_column], test_preds))