# Text Representation in NLP

# Natural Language Processing: Text Classification with Vectorization

## Objective
The aim of this notebook is to demonstrate the process of transforming textual data into numerical form using two different vectorization techniques—CountVectorizer and TfidfVectorizer—and to employ a Logistic Regression model for the classification of text. We will focus on classifying text data as 'toxic' or 'non-toxic'.

## Dataset
We utilize a dataset from the Jigsaw Toxic Comment Classification Challenge, which comprises various user comments labeled for toxicity. The task is to predict the likelihood of a comment being toxic.

## Methodology
We begin by preprocessing the text data with custom tokenization using SpaCy, an advanced natural language processing library. This prepares our text data by converting it into tokens, lemmatizing, and filtering out stop words and punctuation.

Next, we explore two vectorization methods:
1. **CountVectorizer**: This method converts text documents to a matrix of token counts, effectively creating a bag-of-words model where the frequency of words is used as features.
2. **TfidfVectorizer**: This technique produces a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features, giving importance to words based on their frequency in a document and across all documents.

With our text data vectorized, we then train a Logistic Regression model to classify comments into toxic or non-toxic categories. This model is chosen for its efficiency and effectiveness in binary classification tasks.

## Evaluation
We evaluate the performance of our model using metrics such as accuracy, precision, recall, and F1 score, aiming to understand the strengths and limitations of each vectorization method in the context of text classification.

By the end of this notebook, we will have a clear understanding of how to transform text data for machine learning purposes and how to implement a classification model on NLP tasks.




In [None]:
# !pip install sklearn -q
# !pip install spacy -q
# !python -m spacy download en_core_web_sm

In [59]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
import string
import spacy

import numpy as np
import pandas as pd 

In [60]:
np.random.seed(42)

Loading the dataset from a CSV file related to the Jigsaw Toxic Comment Classification Challenge from Kaggle.

Source: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

In [None]:
data = pd.read_csv('../data/jigsaw-toxic-comment-classification-challenge/train.csv')

In [4]:
data.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


The DataFrame `data` contains the following columns:

- `id`: A unique identifier for each comment.
- `comment_text`: The text of the comment itself.
- `toxic`: A binary indicator (0 or 1) for whether the comment is toxic.
- `severe_toxic`: A binary indicator for whether the comment is severely toxic.
- `obscene`: A binary indicator for whether the comment contains obscene language.
- `threat`: A binary indicator for whether the comment contains a threat.
- `insult`: A binary indicator for whether the comment is insulting.
- `identity_hate`: A binary indicator for whether the comment contains hate speech against someone's identity.



## Using Spacy

Loading Spacy Language Model

In [7]:
nlp = spacy.load("en_core_web_sm")


In [8]:
stop_words = nlp.Defaults.stop_words
print(stop_words)

{'if', 'behind', 'about', 'rather', '’s', 'after', 'whether', 'anything', '’m', 'as', 'five', 'anyway', 'per', 'my', 'most', 'still', 'whose', 'should', 'have', 'unless', 'they', 'has', '‘s', 'once', "'m", 'no', 'up', 'elsewhere', 'of', 'at', 'any', 'so', 'throughout', 'together', 'everywhere', 'ca', 'these', 'hundred', 'four', 'yours', 'whom', 'both', 'in', 'front', 'others', 'somehow', 'eleven', 'could', 'would', 'became', 'very', 'do', '‘m', 'last', 'nobody', 'used', 'formerly', 'meanwhile', 'why', 'whenever', 'amongst', 'down', 'often', 'besides', 'two', 'go', 'ten', 'from', 'whereas', 'moreover', "n't", 'whole', 'many', 'them', 'sometime', 'anywhere', 'else', 'such', 'make', 'becoming', 'thru', 'among', 'into', 'indeed', 'itself', 'the', 'latterly', 'otherwise', 'that', 'themselves', 'also', 'its', 'various', 'did', 'hereafter', 'some', 'whither', 'me', 'really', 'on', 'however', 'anyhow', 'all', 're', 'move', 'never', 'fifty', 'he', 'above', 'thereafter', 'out', 'which', 'was', '

In [9]:
punctuations = string.punctuation
print(punctuations)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Custom tokenizer funtion for use with vectorization techniques like `CountVectorizer` and `TfidfVectorizer`. 


This custom tokenizer is more robust than default tokenizers and is beneficial for a couple of reasons:

- **Lemmatization**: By using lemmas instead of the surface form of words, the model can more easily recognize that words like "run" and "running" are variations of the same concept, which can improve model performance.

- **Case Normalization**: Converting all text to lower case helps the model treat words like "The" and "the" as the same word.

- **Stop Words/Punctuation Removal**: Filtering out these reduces the number of features in the vectorized output and focuses the model's attention on the more meaningful content.

The tokenizer is a key part of the preprocessing pipeline for NLP tasks, and when used in conjunction with vectorizers, it prepares the text data for model training, ensuring that the input features are representative of the semantic content of the text.

In [10]:
def spacy_tokenizer(sentence): # uncomment print statements to see intermediate token states
#     print(sentence)
    
    # parsing the input sentence usine spacy
    doc = nlp(sentence)
    
#     print(doc)
#     print(type(doc))

    #creating a list of tokens by lemmatizing each word in the document
    mytokens = [word.lemma_.lower() for word in doc]
    
#     print(mytokens)
    
    # filtering out tokens that are either stop words or punctuations
    mytokens = [token for token in mytokens if token not in stop_words and token not in punctuations]
    
#     print(mytokens)
    
    return mytokens

## CountVectorizer

`CountVectorizer` is a text analysis method that converts a collection of text documents into a matrix of token counts. This is part of the bag-of-words model, where the frequency (count) of each word is used as a feature for training a classifier.

**Key points:**
- It disregards grammar and word order but does capture multiplicity (i.e., how many times a word appears).
- Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix.
- This model can be very useful in cases where the occurrence of words is more important than the order in which they occur.

**CountVectorizer Initialization:**
   
We will initialize `CountVectorizer` with the custom tokenizer, `spacy_tokenizer`, which is a more advanced approach compared to the default tokenizer as it can handle complex patterns in text.

In [11]:
cv = CountVectorizer(tokenizer=spacy_tokenizer)

In [12]:
sentences = ["I am eating apple, I like apple","I am playing cricket"]

In [None]:
#convert the list of sentences into a document-term matrix.
vectors = cv.fit_transform(sentences).toarray() 

In [14]:
vectors

array([[2, 0, 1, 1, 0],
       [0, 1, 0, 0, 1]], dtype=int64)

In [17]:
# Get the feature names to understand what each column in the vectors represents
cv.get_feature_names_out()

array(['apple', 'cricket', 'eat', 'like', 'play'], dtype=object)

The sentences have been vectorized into numerical arrays, and the vocabulary has been identified as `['apple', 'cricket', 'eat', 'like', 'play']`.

The vectorized representation corresponds to the sentences as follows:

1. For "I am eating apple, I like apple":
   - `[2, 0, 1, 1, 0]`
   - 'apple' appears twice, 'eat' once (it seems 'eating' has been reduced to its lemma 'eat'), 'like' once, and 'cricket' and 'play' do not appear.

2. For "I am playing cricket":
   - `[0, 1, 0, 0, 1]`
   - 'cricket' and 'play' appear once each (again 'playing' has been reduced to 'play'), and 'apple', 'eat', and 'like' do not appear.


Interpreting the vectorization results:

- The vocabulary does not include common English stop words like "I" and "am", nor does it include the verb "am" and punctuation ",", which have been removed by our custom tokenizer
- The verbs 'eating' and 'playing' have been lemmatized to 'eat' and 'play', using the custom tokenizer
- The counts in the vectors correspond to how many times each word in the vocabulary appears in each sentence.

This vectorization process has effectively converted the text into a format suitable for use with machine learning algorithms, which require numerical input. The Logistic Regression model can now use these vectors to learn patterns associated with different categories (in this case, potentially toxic versus non-toxic comments if applied to the Jigsaw dataset).

In [18]:
cv.vocabulary_

{'eat': 2, 'apple': 0, 'like': 3, 'play': 4, 'cricket': 1}

### Applying the CountVectorizer on actual data for text classification

In [62]:
X = data['comment_text']
y = data['toxic']

In [63]:
from sklearn.model_selection import train_test_split

In [69]:
# We want to use 5% of the data for model building
# Set the test_size to 0.95 so that 5% remains as the training set
# 5% of the data is randomly selected for model building while preserving the distribution of the 'toxic' column

X, X_unused, y, y_unused = train_test_split(
    X, y, test_size=0.95, stratify=y, random_state=42
)

In [70]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, stratify=y)

In [71]:
X_train.shape

(6382,)

In [72]:
X_test.shape

(1596,)

In [73]:
from sklearn.linear_model import LogisticRegression

In [80]:
X_train_vetcors = cv.fit_transform(X_train)
X_test_vectors = cv.transform(X_test)

In [81]:
X_train_vetcors.shape

(6382, 24542)

In [82]:
X_train_vetcors.toarray()

array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [83]:
model =LogisticRegression(max_iter=1000)
model.fit(X_train_vetcors, y_train)

LogisticRegression(max_iter=1000)

In [84]:
y_pred = model.predict(X_test_vectors)

In [85]:
metrics.accuracy_score(y_test, y_pred)

0.9392230576441103

In [86]:
metrics.precision_score(y_test, y_pred)

0.8181818181818182

In [87]:
metrics.recall_score(y_test, y_pred)

0.47058823529411764

In [88]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97      1443
           1       0.82      0.47      0.60       153

    accuracy                           0.94      1596
   macro avg       0.88      0.73      0.78      1596
weighted avg       0.93      0.94      0.93      1596



### CountVectorizer Results

- **Precision for Class 0 (Non-Toxic)**: 0.95 indicates that 95% of the comments predicted as non-toxic were indeed non-toxic.
- **Recall for Class 0**: 0.99 shows that 99% of the actual non-toxic comments were correctly identified by the model.
- **F1-Score for Class 0**: 0.97 is a high score, showing a good balance between precision and recall for non-toxic comments.

- **Precision for Class 1 (Toxic)**: 0.82 means that 82% of the comments predicted as toxic were actually toxic.
- **Recall for Class 1**: 0.47 indicates that the model could correctly identify only 47% of the actual toxic comments.
- **F1-Score for Class 1**: 0.60, which is considerably lower than for Class 0, suggests that the model is less effective in correctly identifying toxic comments.

- **Overall Accuracy**: 0.94 shows that the model correctly predicted 94% of the comments.

## TfidfVectorizer

`TfidfVectorizer` converts a collection of raw documents to a matrix of TF-IDF features. TF-IDF stands for Term Frequency-Inverse Document Frequency, which reflects how important a word is to a document in a collection or corpus.

**Key points:**
- The TF (Term Frequency) of a word is the frequency of a word (i.e., number of times it appears) in a document.
- The IDF (Inverse Document Frequency) is a measure of how significant a term is in the entire corpus. The IDF increases when the word is rare across documents and decreases when the word is common.
- The TF-IDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

**TfidfVectorizer Initialization:**

Similar to `CountVectorizer`, `TfidfVectorizer` is also initialized with the `spacy_tokenizer`. 

In [75]:
tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer)

In [76]:
vectors = tfidf_vector.fit_transform(sentences).toarray() 

In [77]:
vectors

array([[0.81649658, 0.        , 0.40824829, 0.40824829, 0.        ],
       [0.        , 0.70710678, 0.        , 0.        , 0.70710678]])

In [78]:
tfidf_vector.get_feature_names_out()

array(['apple', 'cricket', 'eat', 'like', 'play'], dtype=object)

The output shows the TF-IDF vectorized representation of the same sentences, with the vocabulary identified as `['apple', 'cricket', 'eat', 'like', 'play']`.

The TF-IDF vectors are:

1. For "I am eating apple, I like apple":
   - `[0.81649658, 0., 0.40824829, 0.40824829, 0.]`
   - 'apple' has the highest value since it is the most frequent term in this sentence but not the only term, so its TF-IDF score is high but not as high as if it were the only word in the document.
   - 'eat' and 'like' have the same TF-IDF value, which is lower than 'apple' because they appear fewer times.
   - 'cricket' and 'play' are not present in this sentence, hence their TF-IDF value is 0.

2. For "I am playing cricket":
   - `[0., 0.70710678, 0., 0., 0.70710678]`
   - 'cricket' and 'play' have equal TF-IDF values, as they both appear once and are the only terms in the sentence.
   - 'apple', 'eat', and 'like' do not appear in this sentence, so their TF-IDF values are 0.

The values in the TF-IDF vectors are not simple counts, but rather weighted frequencies that are designed to reflect the importance of each term within the sentence as well as across the set of sentences. The TF-IDF score increases with the number of times a word appears in a document but is offset by the number of documents in the corpus that the word appears in. This is why TF-IDF is a popular feature extraction method for text data, as it can highlight words that are distinct to a particular document.

Interpretation of the TF-IDF results:

- The TF (term frequency) part of TF-IDF for 'apple' in the first sentence is high, as it appears twice.
- The IDF (inverse document frequency) part lowers the weight for 'apple' slightly since it appears in half of the documents (1 out of 2 sentences), but not as much as if it appeared in all documents.
- The word 'cricket' and 'play' have a lower TF in the second sentence (since they appear only once) but a higher IDF (since they do not appear in the first sentence), leading to a moderate TF-IDF score.
- The values are normalized, which is why they are less than 1, and they are not integers like in the `CountVectorizer` output.

The resulting vectors from `TfidfVectorizer` can now be used as input for machine learning models such as Logistic Regression, just like the ones from `CountVectorizer`. The difference is that the TF-IDF vectors provide a more nuanced representation that gives more weight to terms that are distinctive to each document.

In [89]:
tfidf_vector.vocabulary_

{'eat': 2, 'apple': 0, 'like': 3, 'play': 4, 'cricket': 1}

### Applying the TfidfVectorizer on actual data for text classification

In [90]:
X_train_vectors_tf = tfidf_vector.fit_transform(X_train)
X_test_vectors_tf = tfidf_vector.transform(X_test)

In [50]:
X_train_vectors_tf.shape

(1600, 10899)

In [91]:
model = LogisticRegression()

In [92]:
model.fit(X_train_vectors_tf, y_train)

LogisticRegression()

In [93]:
y_pred_tf = model.predict(X_test_vectors_tf)

In [94]:
metrics.accuracy_score(y_test, y_pred_tf)

0.9348370927318296

In [95]:
metrics.precision_score(y_test, y_pred_tf)

0.9803921568627451

In [96]:
metrics.recall_score(y_test, y_pred_tf)

0.32679738562091504

In [97]:
print(classification_report(y_test, y_pred_tf))

              precision    recall  f1-score   support

           0       0.93      1.00      0.97      1443
           1       0.98      0.33      0.49       153

    accuracy                           0.93      1596
   macro avg       0.96      0.66      0.73      1596
weighted avg       0.94      0.93      0.92      1596



### TfidfVectorizer Results

- **Precision for Class 0**: 0.93 shows high accuracy in predicting non-toxic comments.
- **Recall for Class 0**: 1.00 indicates that all actual non-toxic comments were correctly identified.
- **F1-Score for Class 0**: 0.97, which is excellent.

- **Precision for Class 1**: 0.98 is very high, suggesting that almost all comments predicted as toxic were indeed toxic.
- **Recall for Class 1**: 0.33 shows that the model missed many toxic comments, identifying only 33% of them correctly.
- **F1-Score for Class 1**: 0.49, lower than that achieved with `CountVectorizer`.

- **Overall Accuracy**: 0.93, slightly lower than the model trained with `CountVectorizer`.

### Comparison and Interpretation

- **Performance on Non-Toxic Comments (Class 0)**: Both models perform excellently in identifying non-toxic comments, with high precision, recall, and F1-scores.
  
- **Performance on Toxic Comments (Class 1)**: The model trained with `CountVectorizer` has a better balance between precision and recall for toxic comments, as indicated by the higher F1-score. The `TfidfVectorizer` model has higher precision but significantly lower recall, meaning it's more conservative in labeling comments as toxic.

- **Overall Accuracy**: Both models have similar overall accuracy, but the `CountVectorizer` model has a slightly higher accuracy.

- **Bias Towards Majority Class**: Both models show a tendency to perform better on the majority class (non-toxic), which is common in imbalanced datasets. The low recall for toxic comments (Class 1) in both models suggests that they struggle with identifying the minority class effectively.

### Conclusion

For this particular dataWhile both models are good at predicting non-toxic comments, there's a notable difference in how they handle toxic comments. The model with `CountVectorizer` provides a more balanced performance for the toxic class, making it potentially more suitable for scenarios where identifying toxic comments is crucial. However, the high precision but low recall of the `TfidfVectorizer` model might be preferable in situations where false positives (non-toxic comments labeled as toxic) are more problematic than false negatives (toxic comments not identified). Depending on the specific needs and tolerance for false positives/negatives, one might be chosen over the other.

In [58]:
confusion_matrix(y_test, y_pred)

array([[358,   2],
       [ 25,  15]], dtype=int64)