<a href="https://colab.research.google.com/github/raviteja-padala/NLP/blob/main/Text_Classification_News_Categorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Categorization using Word Embeddings and SpaCy



**Objective:**
The primary objective of this project is to demonstrate text categorization using pre-trained Word2Vec word embeddings combined with the natural language processing capabilities of the spaCy library. The project aims to preprocess text data, convert it into meaningful vector representations using Word2Vec embeddings, partition the data into training and testing sets, and create a classifier capable of assigning appropriate categories or labels to input text samples.

**Use Cases:**
1. **News Authenticity Assessment:** The project's model can effectively differentiate between authentic and fake news articles, which is crucial for media literacy and reliable information consumption.
2. **Sentiment Analysis:** By training the classifier on sentiment-labeled data, it can predict the sentiment expressed in textual content, enabling businesses to gauge public sentiment around products or services.

In [10]:
# import pandas library
import pandas as pd

# importing dataset
news_df = pd.read_csv("https://raw.githubusercontent.com/raviteja-padala/Datasets/main/fake_and_real_news.csv")

In [11]:
news_df.head()

Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


In [12]:
df = news_df.copy()

In [13]:
#check the distribution of labels
df['label'].value_counts()

Fake    5000
Real    4900
Name: label, dtype: int64

In [14]:
#Add the new column which gives a unique number to each of these labels

df['label_num'] = df['label'].map({'Fake' : 0, 'Real': 1})

In [15]:
df.head()

Unnamed: 0,Text,label,label_num
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,0
1,U.S. conservative leader optimistic of common ...,Real,1
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,1
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,0
4,Democrats say Trump agrees to work on immigrat...,Real,1


In [16]:
#Load Google News Word2vec model from gensim library
import gensim.downloader as api

# Load the pre-trained Word2Vec model from Google News dataset
# This model contains word vectors with 300 dimensions
wv = api.load('word2vec-google-news-300')



In [17]:
# en_core_web_lg, is a larger English language model. This is trained on larger amount of data and includes word vectors of higher dimensionality.
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.6.0/en_core_web_lg-3.6.0-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [18]:
import spacy
from spacy.lang.en.examples import sentences

#nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_lg")

In [19]:
# Define a function to preprocess text and vectorize using Word2Vec
def preprocess_and_vectorize(text):
    doc = nlp(text)
    filtered_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return wv.get_mean_vector(filtered_tokens)

In [20]:
# Define a text for vectorization
text = "Text Classification using Word Embeddings"

# Call the preprocess_and_vectorize function to convert text into a vector
v = preprocess_and_vectorize(text)

# Check the shape of the vector
print("Shape of the vector:", v.shape)

Shape of the vector: (300,)


# Word2vec mean vector

The Word2Vec mean vector is a technique used to aggregate individual word vectors to create a single vector representation for a collection of words. This mean vector represents the central theme or context of the words in the collection. It's particularly useful when dealing with sentences, phrases, or documents, where the goal is to capture the overall meaning rather than the specific details of each word.

In [22]:
# Import the necessary libraries
import numpy as np

# Get word vectors for "worry" and "understand"
v1 = wv["worry"]
v2 = wv["understand"]

# Calculate the mean vector using NumPy
numpy_mean = np.mean([v1, v2], axis=0)[:3]  # Taking the first 3 dimensions
print(f"NumPy Mean Vector (Starting 3 Dimensions): {numpy_mean}")

# Calculate the mean vector using Word2Vec's get_mean_vector method
wv_mean = wv.get_mean_vector([v1, v2])[:3]  # Taking the first 3 dimensions
print(f"Word2Vec Mean Vector (Starting 3 Dimensions): {wv_mean}")


NumPy Mean Vector (Starting 3 Dimensions): [ 0.00976562 -0.00561523 -0.08905029]
Word2Vec Mean Vector (Starting 3 Dimensions): [ 0.00976562 -0.00561523 -0.08905029]


# Vectorising text column of dataframe

In [24]:
from tqdm import tqdm  # Import tqdm library

In [None]:
# Create a new column 'vector' in the dataframe by applying the preprocess_and_vectorize function
df['vector'] = df['Text'].apply(lambda text: preprocess_and_vectorize(text))

In [27]:
df.head()

Unnamed: 0,Text,label,label_num,vector
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,0,"[0.008657642, 0.019024342, -0.011917442, 0.032..."
1,U.S. conservative leader optimistic of common ...,Real,1,"[0.010864096, 0.007960429, 0.0011915653, 0.014..."
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,1,"[0.018134918, 0.0062743523, -0.005872244, 0.03..."
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,0,"[0.01255197, 0.012613623, 5.9780963e-05, 0.021..."
4,Democrats say Trump agrees to work on immigrat...,Real,1,"[-0.0019059887, 0.011889367, 0.0035395357, 0.0..."


# Train test split

In [28]:
from sklearn.model_selection import train_test_split


#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
X_train, X_test, y_train, y_test = train_test_split(
    df.vector.values,
    df.label_num,
    test_size=0.2, # 20% samples will go to test dataset
    random_state=2022,
    stratify=df.label_num
)

In [29]:
# Print shape before and after reshaping
print("Shape of X_train before reshaping: ", X_train.shape)
print("Shape of X_test before reshaping: ", X_test.shape)

Shape of X_train before reshaping:  (7920,)
Shape of X_test before reshaping:  (1980,)


Shape of X_train before reshaping: (7920,):
This indicates that the X_train array initially contains 7920 elements (samples), but each element is not yet structured as a separate feature vector. This is a 1D array, where each element is a text vector that needs to be reshaped.

Shape of X_test before reshaping: (1980,):
Similar to X_train, the X_test array contains 1980 elements (samples), but each element is not yet structured as a separate feature vector. Like before, this is also a 1D array that needs to be reshaped.


# Need to reshape

- Many machine learning algorithms expect the input data to be a 2D array or matrix, where each row represents a sample instance and each column represents a feature. Therefore, to ensure compatibility with various algorithms, it's common to reshape the data into this format.


In [30]:
#reshaping to 2d array
X_train_2d = np.stack(X_train)
X_test_2d =  np.stack(X_test)

#shape  after reshaping
print("Shape of X_train after reshaping: ", X_train_2d.shape)
print("Shape of X_test after reshaping: ", X_test_2d.shape)

Shape of X_train after reshaping:  (7920, 300)
Shape of X_test after reshaping:  (1980, 300)


- Shape of X_train after reshaping: (7920, 300):
After reshaping, the X_train array is transformed into a 2D array with a shape of (7920, 300). This means that it now has 7920 rows (each corresponding to a sample) and 300 columns (each corresponding to a feature in the vector representation).

- Shape of X_test after reshaping: (1980, 300):
Similarly, the X_test array is also reshaped into a 2D array with a shape of (1980, 300), having 1980 rows and 300 columns.
- The reshaping process has transformed the original 1D arrays of text vectors into 2D arrays, where each row represents a sample (news article) and each column represents a feature in the vector representation (in this case, each column represents a dimension of the Word2Vec embedding). This reshaped format is suitable for training and using machine learning models that expect data to be structured in this manner.

In [31]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

#1. creating a GradientBoosting model object
clf = GradientBoostingClassifier()

#2. fit with all_train_embeddings and y_train
clf.fit(X_train_2d, y_train)


#3. get the predictions for all_test_embeddings and store it in y_pred
y_pred = clf.predict(X_test_2d)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.97      0.98      1000
           1       0.97      0.99      0.98       980

    accuracy                           0.98      1980
   macro avg       0.98      0.98      0.98      1980
weighted avg       0.98      0.98      0.98      1980



# Making predictions

In [32]:
# Make predictions using the trained classifier
test_news = [
    "Michigan governor denies misleading U.S. House on Flint water (Reuters) - Michigan Governor Rick Snyder denied Thursday that he had misled a U.S. House of Representatives committee last year over testimony on Flintâ€™s water crisis after lawmakers asked if his testimony had been contradicted by a witness in a court hearing. The House Oversight and Government Reform Committee wrote Snyder earlier Thursday asking him about published reports that one of his aides, Harvey Hollins, testified in a court hearing last week in Michigan that he had notified Snyder of an outbreak of Legionnairesâ€™ disease linked to the Flint water crisis in December 2015, rather than 2016 as Snyder had testified. â€œMy testimony was truthful and I stand by it,â€ Snyder told the committee in a letter, adding that his office has provided tens of thousands of pages of records to the committee and would continue to cooperate fully.  Last week, prosecutors in Michigan said Dr. Eden Wells, the stateâ€™s chief medical executive who already faced lesser charges, would become the sixth current or former official to face involuntary manslaughter charges in connection with the crisis. The charges stem from more than 80 cases of Legionnairesâ€™ disease and at least 12 deaths that were believed to be linked to the water in Flint after the city switched its source from Lake Huron to the Flint River in April 2014. Wells was among six current and former Michigan and Flint officials charged in June. The other five, including Michigan Health and Human Services Director Nick Lyon, were charged at the time with involuntary manslaughter",
    " WATCH: Fox News Host Loses Her Sh*t, Says Investigating Russia For Hacking Our Election Is Unpatriotic This woman is insane.In an incredibly disrespectful rant against President Obama and anyone else who supports investigating Russian interference in our election, Fox News host Jeanine Pirro said that anybody who is against Donald Trump is anti-American. Look, it s time to take sides,  she began.",
    " Sarah Palin Celebrates After White Man Who Pulled Gun On Black Protesters Goes Unpunished (VIDEO) Sarah Palin, one of the nigh-innumerable  deplorables  in Donald Trump s  basket,  almost outdid herself in terms of horribleness on Friday."
]

test_news_vectors = [preprocess_and_vectorize(n) for n in test_news]
predictions = clf.predict(test_news_vectors)

print("Predictions:", predictions)

Predictions: [1 0 0]


In [33]:
# Reverse mapping of label_num to label
label_num_to_label = {0: 'Fake', 1: 'Real'}

# Make predictions using the trained classifier
test_news = [
    "WATCH: Fox News Host Loses Her Sh*t, Says Investigating Russia For Hacking Our Election Is Unpatriotic This woman is insane.In an incredibly disrespectful rant against President Obama and anyone else who supports investigating Russian interference in our election, Fox News host Jeanine Pirro said that anybody who is against Donald Trump is anti-American. Look, it s time to take sides,  she began."
]

test_news_vectors = [preprocess_and_vectorize(n) for n in test_news]
predictions = clf.predict(test_news_vectors)

# Map numeric predictions to label names
predicted_labels = [label_num_to_label[prediction] for prediction in predictions]

# Print the predictions as 'Fake' or 'Real'
for news, prediction in zip(test_news, predicted_labels):
    print(f"News: {news}\nPrediction: {prediction}\n")


News: WATCH: Fox News Host Loses Her Sh*t, Says Investigating Russia For Hacking Our Election Is Unpatriotic This woman is insane.In an incredibly disrespectful rant against President Obama and anyone else who supports investigating Russian interference in our election, Fox News host Jeanine Pirro said that anybody who is against Donald Trump is anti-American. Look, it s time to take sides,  she began.
Prediction: Fake



Top 5 unusual tragic deaths on sets # Top5darkests 0 Over the years, conspiracies and theories of paranormal activity on movie sets has grown. With a large amount of horror productions having unfortunate deaths, some deaths closely resembling story lines of the horror production, theories of movies with a curse has been spoken by some. From deaths on movie productions involving the devil to the conspiracy of the hanging extra in the wizard of oz, we will cover in this video our top 5 unusual tragic deaths on sets.

In [34]:
# Reverse mapping of label_num to label
label_num_to_label = {0: 'Fake', 1: 'Real'}

# Make predictions using the trained classifier
test_news = [
    "Top 5 unusual tragic deaths on sets # Top5darkests 0 Over the years, conspiracies and theories of paranormal activity on movie sets has grown. With a large amount of horror productions having unfortunate deaths, some deaths closely resembling story lines of the horror production, theories of movies with a curse has been spoken by some. From deaths on movie productions involving the devil to the conspiracy of the hanging extra in the wizard of oz, we will cover in this video our top 5 unusual tragic deaths on sets."
]

test_news_vectors = [preprocess_and_vectorize(n) for n in test_news]
predictions = clf.predict(test_news_vectors)

# Map numeric predictions to label names
predicted_labels = [label_num_to_label[prediction] for prediction in predictions]

# Print the predictions as 'Fake' or 'Real'
for news, prediction in zip(test_news, predicted_labels):
    print(f"News: {news}\nPrediction: {prediction}\n")


News: Top 5 unusual tragic deaths on sets # Top5darkests 0 Over the years, conspiracies and theories of paranormal activity on movie sets has grown. With a large amount of horror productions having unfortunate deaths, some deaths closely resembling story lines of the horror production, theories of movies with a curse has been spoken by some. From deaths on movie productions involving the devil to the conspiracy of the hanging extra in the wizard of oz, we will cover in this video our top 5 unusual tragic deaths on sets.
Prediction: Fake



In [35]:
# Reverse mapping of label_num to label
label_num_to_label = {0: 'Fake', 1: 'Real'}

# Make predictions using the trained classifier
test_news = [
    "Hillary Clinton faces the last major contest of the primary campaign on Tuesday having already been declared the Democratic presidential nominee, making her the first woman in history to lead a major party bid for the White House. The declaration that Clinton had won the support of the 2,383 delegates needed to clinch the nomination came from the Associated Press late on Monday, before voting was due to commence in primaries in California and five other states. The legitimacy of AP’s declaration, which was announced 24 hours earlier than her campaign expected, was immediately called into question by Clinton’s rival, Bernie Sanders. The Vermont senator’s campaign issued a defiant statement that condemned the media’s “rush to judgment” and signalled that the Vermont senator was willing, if possible, to contest the nomination at the Democratic National Convention in July. However, as voters headed to the polls in California, New Jersey, Montana, North Dakota, South Dakota and New Mexico, it was clear that the mathematics were squarely on the side of the former secretary of state. The unexpected and somewhat anti-climactic twist in the race appeared to surprise the Clinton campaign, which has not altered its plan and is waiting until voting concludes on Tuesday before declaring her the Democratic nominee-in-waiting at a victory party in New York. Clinton made reference to the AP declaration during a campaign event in Long Beach, California, on Monday night. “I got to tell you, according to the news, we are on the brink of a historic, historic, unprecedented moment, but we still have work to do, don’t we?” she said. On Tuesday Clinton secured the endorsement of House Democratic leader Nancy Pelosi of California and, according to US media reports, aides to Barack Obama are in discussion with her campaign with a view to the president formally backing her soon. "
]

test_news_vectors = [preprocess_and_vectorize(n) for n in test_news]
predictions = clf.predict(test_news_vectors)

# Map numeric predictions to label names
predicted_labels = [label_num_to_label[prediction] for prediction in predictions]

# Print the predictions as 'Fake' or 'Real'
for news, prediction in zip(test_news, predicted_labels):
    print(f"News: {news}\nPrediction: {prediction}\n")


News: Hillary Clinton faces the last major contest of the primary campaign on Tuesday having already been declared the Democratic presidential nominee, making her the first woman in history to lead a major party bid for the White House. The declaration that Clinton had won the support of the 2,383 delegates needed to clinch the nomination came from the Associated Press late on Monday, before voting was due to commence in primaries in California and five other states. The legitimacy of AP’s declaration, which was announced 24 hours earlier than her campaign expected, was immediately called into question by Clinton’s rival, Bernie Sanders. The Vermont senator’s campaign issued a defiant statement that condemned the media’s “rush to judgment” and signalled that the Vermont senator was willing, if possible, to contest the nomination at the Democratic National Convention in July. However, as voters headed to the polls in California, New Jersey, Montana, North Dakota, South Dakota and New Me

## Conclusion:

This project showcases the synergy between pre-trained Word2Vec embeddings and spaCy's NLP functionalities, presenting a comprehensive solution for text categorization tasks. The pipeline's steps encompass data preprocessing, embedding conversion, classifier training, and result evaluation. This integration empowers users to make informed decisions based on data-driven insights and offers a foundation for developing more advanced and impactful natural language processing applications.