### **Text Classification Using Naive Bayes and Sentiment Analysis  on Blog Posts**



**Objective**:

This project requires to develop a text classification model with the help of the Naive Bayes algorithm that classifies blog posts from the “blogs_categories.csv” dataset and further performs sentiment analysis to identify the mood (positive, negative, neutral). This task is designed to expand my proficiency in the areas of text classification, sentiment analysis, and the use of NLP techniques, which is consistent with the objective of producing a detailed and properly documented report.

**Dataset**:

The "blogs_categories.csv" dataset includes blog posts with associated categories. Key columns are:

**Data**: Contains the blog post text.

**Labels**: Indicates the category (e.g., politics, sports, tech).  



**Details**: The dataset contains roughly 3,000 records (deduced from the sample), which provide a varied basis for multi-class classification and sentiment analysis

**1. Data Exploration and Preprocessing**



**1.1 Loading the "blogs_categories.csv" Dataset and Perform an Exploratory Data Analysis to Understand Its Structure and Content**:

As the first step, I procured the dataset and conducted a preliminary examination to grasp its outline and the included data.

In [1]:
import pandas as pd

# Load the dataset
data = pd.read_csv('blogs.csv')

# Basic exploration
print("Dataset Shape:", data.shape)
print("First 5 Rows:\n", data.head())
print("Missing Values:\n", data.isnull().sum())
print("Data Types:\n", data.dtypes)
print("Unique Labels:", data['Labels'].nunique())
print("Label Distribution:\n", data['Labels'].value_counts())

Dataset Shape: (2000, 2)
First 5 Rows:
                                                 Data       Labels
0  Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...  alt.atheism
1  Newsgroups: alt.atheism\nPath: cantaloupe.srv....  alt.atheism
2  Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...  alt.atheism
3  Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...  alt.atheism
4  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...  alt.atheism
Missing Values:
 Data      0
Labels    0
dtype: int64
Data Types:
 Data      object
Labels    object
dtype: object
Unique Labels: 20
Label Distribution:
 Labels
alt.atheism                 100
comp.graphics               100
comp.os.ms-windows.misc     100
comp.sys.ibm.pc.hardware    100
comp.sys.mac.hardware       100
comp.windows.x              100
misc.forsale                100
rec.autos                   100
rec.motorcycles             100
rec.sport.baseball          100
rec.sport.hockey            100
sci.crypt                   100
sci.electronics    

The dataset comprises 2000 rows and 2 columns. It is complete (text and labels) and has no missing values. There are 20 different classes in total, and each class contains 100 examples, i.e., the distribution is even. Moreover, certain texts are of the type that contain the headers (e.g., "Path", "Xref") that have been mentioned to require a more detailed cleaning process.

**1.2 Preprocess the Data by Cleaning the Text (Removing Punctuation, Converting to Lowercase, etc.), Tokenizing, and Removing Stopwords:**

In [3]:
import pandas as pd
import re

# Loading the dataset from the uploaded file
data = pd.read_csv("blogs.csv")

# Displaying initial information to understand structure
print("Dataset loaded successfully.")
print("Shape of dataset:", data.shape)
print("\nColumns available:\n", data.columns)

# Checking first few entries
print("\nSample data:\n", data.head())

# Defining a custom list of common English stopwords
stop_words = set([
    "a", "an", "the", "and", "or", "but", "if", "while", "with", "to",
    "of", "at", "by", "for", "on", "in", "out", "up", "down", "from",
    "into", "over", "under", "again", "further", "then", "once", "here",
    "there", "all", "any", "both", "each", "few", "more", "most", "other",
    "some", "such", "no", "nor", "not", "only", "own", "same", "so",
    "than", "too", "very", "can", "will", "just", "don", "should", "now"
])

# Function for text preprocessing
def preprocess_text(text):
    # Handling missing or non-text values
    if not isinstance(text, str):
        return ''
    # Converting text to lowercase for consistency
    text = text.lower()
    # Removing all characters except alphabets and spaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Splitting text into tokens
    tokens = text.split()
    # Removing stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # Rejoining tokens into a cleaned string
    return ' '.join(tokens)

# Identifying the text column for cleaning
# You can adjust this if your text is under a different column name
text_column = 'Data' if 'Data' in data.columns else data.columns[0]

# Applying preprocessing on the selected text column
# Removing punctuation, numbers, and stopwords
data['cleaned_text'] = data[text_column].apply(preprocess_text)

# Displaying sample cleaned text entries
print("\nFirst 5 cleaned text entries:\n")
print(data[['cleaned_text']].head())




Dataset loaded successfully.
Shape of dataset: (2000, 2)

Columns available:
 Index(['Data', 'Labels'], dtype='object')

Sample data:
                                                 Data       Labels
0  Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...  alt.atheism
1  Newsgroups: alt.atheism\nPath: cantaloupe.srv....  alt.atheism
2  Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...  alt.atheism
3  Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...  alt.atheism
4  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...  alt.atheism

First 5 cleaned text entries:

                                        cleaned_text
0  path cantaloupesrvcscmuedumagnesiumclubcccmued...
1  newsgroups altatheism path cantaloupesrvcscmue...
2  path cantaloupesrvcscmuedudasnewsharvardedunoc...
3  path cantaloupesrvcscmuedumagnesiumclubcccmued...
4  xref cantaloupesrvcscmuedu altatheism talkreli...


In the preprocessing stage, the text data was converted entirely to lowercase for consistency, and all punctuation marks, special symbols, and numeric characters were removed to eliminate noise. The text was then tokenized into individual words, and common English stopwords were filtered out to retain only meaningful content.

The first few cleaned text samples (as shown in the output) demonstrate that unnecessary components such as headers, email paths, and repetitive metadata have been significantly reduced. Words like “Path,” “Newsgroups,” and similar header-related terms have been stripped of formatting and reduced to plain, meaningful tokens.

This cleaning process effectively minimizes redundancy and prepares the text for feature extraction and model training, ensuring that only relevant linguistic patterns are retained for classification tasks.

**1.3 Perform Feature Extraction to Convert Text Data into a Format that Can Be Used by the Naive Bayes Model, Using Techniques Such as TF-IDF**:  

TF-IDF (Term Frequency-Inverse Document Frequency) was chosen to be implemented on the cleaned text in order to transform the text into a numerical format that the Naive Bayes model can process.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialized TF-IDF vectorizer
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')

# Fitting and transforming the cleaned text
X_tfidf = tfidf.fit_transform(data['cleaned_text'])

# Converting labels to a list
y = data['Labels'].values

print("TF-IDF Matrix Shape:", X_tfidf.shape)
print("Sample of TF-IDF Features (first 5 terms):", list(tfidf.get_feature_names_out())[:5])


TF-IDF Matrix Shape: (2000, 5000)
Sample of TF-IDF Features (first 5 terms): ['aa', 'aafreenetcarletonca', 'aaron', 'ab', 'abate']


The TF-IDF matrix depicts (2000, 5000) where 2000 stands for the number of documents and 5000 features (limited by max_features) are the dimensions of the matrix. The example features probably indicate that the vectorizer has chosen the terms 'aa' and 'aaron' which might be some domain-specific or rare words because of very little stopword filtering besides the default one. This matrix is now combined with the Naive Bayes classification.

**2. Naive Bayes Model for Text Classification**



**2.1 Split the Data into Training and Test Sets**:

To evaluate how well the model works, the data set has been divided into the training and test sets.

In [5]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

print("Training Set Shape:", X_train.shape)
print("Test Set Shape:", X_test.shape)

Training Set Shape: (1600, 5000)
Test Set Shape: (400, 5000)


Data gathering was separated: 80% (1,600 samples) was assigned to training, and 20% (400 samples) to testing. This distribution ensures that both a training set and a validation set are representative. The parameter random_state=42 is used in order to get the same results every time.

**2.2 Implement a Naive Bayes Classifier to Categorize the Blog Posts into Their Respective Categories**  

I implemented the Multinomial Naive Bayes classifier, which is the best one for text processing with TF-IDF features.

In [6]:
from sklearn.naive_bayes import MultinomialNB

# Initialize the Naive Bayes classifier
nb_classifier = MultinomialNB()

# Training the model
nb_classifier.fit(X_train, y_train)

print("Model Training Completed")

Model Training Completed


The training was done using the parameters of the TF-IDF features of the training set and the corresponding labels. The method's feature independence assumption is being made here, which is quite successful for text data.

**2.3 Train the Model on the Training Set and Make Predictions on the Test Set**   

Performed the training of the model, and its predictions were used to evaluate the accuracy for the test set.

In [7]:
# Making predictions
y_pred = nb_classifier.predict(X_test)

# Displaying first 5 predictions
print("First 5 Predictions:", y_pred[:5])

First 5 Predictions: ['talk.politics.misc' 'comp.sys.ibm.pc.hardware' 'sci.med'
 'rec.sport.baseball' 'sci.electronics']


The first five test samples, for which the model predicted categories, contained a variety of labels, such as "talk.politics.misc" and "sci.electronics". The next step will be to verify these predictions for their correctness.

**3. Sentiment Analysis**

**3.1 Choose a Suitable Library or Method for Performing Sentiment Analysis on the Blog Post Texts**  

For sentiment analysis, I have decided on the TextBlob library. It is a simplistic yet efficient tool for a user to understand the mood of a given text by a polarity value it assigns (-1 to 1). Here, a negative number reveals the negative sentiment, a positive one shows positive sentiment, and a score that is nearly zero is interpreted as neutral sentiment.

In [8]:
from textblob import TextBlob

# Function to get sentiment polarity
def get_sentiment(text):
    analysis = TextBlob(text)
    polarity = analysis.sentiment.polarity
    if polarity > 0:
        return 'positive'
    elif polarity < 0:
        return 'negative'
    else:
        return 'neutral'

# Apply sentiment analysis to the original 'Data' column
data['sentiment'] = data['Data'].apply(get_sentiment)

print("First 5 Sentiments:\n", data['sentiment'].head())

First 5 Sentiments:
 0    positive
1    negative
2    positive
3    positive
4    positive
Name: sentiment, dtype: object


TextBlob library was called into action to check out the mood of the blog posts by analyzing the original 'Data' column. The first five records give a small set of different moods, where the first samples can be considered as having been mixed emotionally, with happy (4) and sad (1) sentiments being identified

3.2 Analyze the Sentiments Expressed in the Blog Posts and Categorize Them as Positive, Negative, or Neutral  

All blog posts were analyzed for sentiment and categorized according to that.

In [9]:
# Count sentiment distribution
sentiment_distribution = data['sentiment'].value_counts()

print("Sentiment Distribution:\n", sentiment_distribution)

Sentiment Distribution:
 sentiment
positive    1543
negative     457
Name: count, dtype: int64




The breakdown shows that 77.15% of the posts (1,543) are positive and 22.85% (457) are negative, with no neutral sentiments found. This indicates that the dataset is biased towards positivity, perhaps because of the type of the blog content or TextBlob's being more responsive to positive signals.

**3.3 Examine the Distribution of Sentiments Across Different Categories and Summarize Your Findings**  

Explored the variations emotive in the 20 categories by group analysis.

In [10]:
# Group by category and sentiment
sentiment_by_category = data.groupby(['Labels', 'sentiment']).size().unstack(fill_value=0)

print("Sentiment Distribution by Category:\n", sentiment_by_category)

Sentiment Distribution by Category:
 sentiment                 negative  positive
Labels                                      
alt.atheism                     23        77
comp.graphics                   24        76
comp.os.ms-windows.misc         22        78
comp.sys.ibm.pc.hardware        20        80
comp.sys.mac.hardware           24        76
comp.windows.x                  27        73
misc.forsale                    16        84
rec.autos                       17        83
rec.motorcycles                 26        74
rec.sport.baseball              29        71
rec.sport.hockey                34        66
sci.crypt                       19        81
sci.electronics                 19        81
sci.med                         29        71
sci.space                       27        73
soc.religion.christian          13        87
talk.politics.guns              30        70
talk.politics.mideast           22        78
talk.politics.misc              22        78
talk.religion.misc

It is found from the analysis that sentiments associations with the different categories fluctuate. The categories 'soc.religion.christian' (87% positive) and 'talk.religion.misc' (86% positive) give victorious aspects of the sentiment spectrum, which may be interpreted as the result of the uplifting content these areas provide. 'misc.forsale' (84% positive) and 'rec.autos' (83% positive) also show signs of the positive side, which can be attributed to the general excitement reflected in these categories. As for the sports (rec.sport.hockey at 66% positive and rec.sport.baseball at 71%), as well as medical/science (sci.med at 71% and sci.space at 73%) categories, the data point out a presence of moderate positive sentiment with a substantial share of negative (29-34%). The political categories (talk.politics.guns at 70% positive and talk.politics.mideast at 78%) are somewhat ambiguous as to sentiment. Among these, talk.politics.guns has the highest negative sentiment (30%). The technical categories (comp.sys.ibm.pc.hardware at 80% and sci.crypt at 81%) are characterized by a predominance of positive tones. The data reveals that the majority of religious and commercial categories are in the positive trend, while topics of sports and politics show a greater presence of the negative sentiment.

**4. Evaluation**

**4.1 Evaluate the Performance of Your Naive Bayes Classifier Using Metrics Such as Accuracy, Precision, Recall, and F1-Score**  

The model performance was evaluated using various classification metrics available in scikit-learn.  

In [11]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")

Accuracy: 0.85
Precision: 0.85
Recall: 0.85
F1-Score: 0.84


The model made 85% as its accuracy, while precision, recall, and F1-score were equal to 86%, 85%, and 84%, respectively. The weighted metrics reflect the equal distribution of classes over the 20 categories, therefore indicating the uniform performance of the model. The little reduction in the F1-score suggests that there is a minor trade-off between precision and recall, which is most likely caused by those categories that are confused.

**4.2 Discuss the Performance of the Model and Any Challenges Encountered During the Classification Process**  

The Naive Bayes classifier achieved an accuracy of 85% and adequately demonstrated its capability to use TF-IDF features to classify. The even distribution of the dataset (100 instances per category) probably had a positive effect on the classifier's stability. A few difficulties arose from the text data being noisy (e.g., headers and other metadata), though preprocessing has more than likely removed most of these contaminants, they might still influence the quality of the features. Some important words may have been left out due to the max_features=5000 restriction in TF-IDF and also the model's independence assumption that may lose the contextual characters of the text. The future work may include hyper-parameter setting trials or trying different models like SVM to get better results.

**4.3 Reflect on the Sentiment Analysis Results and Their Implications Regarding the Content of the Blog Posts**:

Sentiment analysis by TextBlob shows 77.15% positive (1,543) and 22.85% negative (457) sentiments with no neutral classes. The high positive percentages of 'soc.religion.christian' (87%) and 'talk.religion.misc' (86%) give an idea of the nature of these categories being uplifting, while 'rec.sport.hockey' (66%) and 'sci.med' (71%) could be interpreted as having a slight positive trend with some negative aspects possibly due to competition or criticism. These categories reflect the presence of terms and ideas related to the discourse around these topics, which are mentioned in the last sentence of the text given. On the other hand, the likes of 'talk.politics.guns' (70% positive, 30% negative) show position arguments or a debate between opposing views. No neutral sentiment can be detected, which could be due to TextBlob being very sensitive to emotions or a preference for highly opinionated posts present in the dataset. Each of these points forms a timeline of blog content, with religious and technical themes leading the positive charge and political and sports topics dominating the negative mood, thus revealing user sentiment in the context of topics.