# Assignment 19 : Text Classification using Naive Bayes and Sentiment Analysis on Blog Posts

## Objective :

 Building the text Classification model using the Naive Bayes algorithm to categorize the blog posts accurately, and performing sentiment analysis to understand the general sentiment (positive,negative,neutral) expressed in the posts.

## Dataset Description:

**Text**: Then content of the blog post, column name:Data

**Category**: The category to which the blog post belongs, Cloumn name : Labels

## Task 1 : Data Exploration and Preprocessing

In [1]:
# Import necessary libraries.
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [2]:
# load the dataset.
data = pd.read_csv("blogs.csv")

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Data    2000 non-null   object
 1   Labels  2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB


In [4]:
data.head()

Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism


In [5]:
data.Labels.unique()

array(['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
       'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
       'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles',
       'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt',
       'sci.electronics', 'sci.med', 'sci.space',
       'soc.religion.christian', 'talk.politics.guns',
       'talk.politics.mideast', 'talk.politics.misc',
       'talk.religion.misc'], dtype=object)

In [6]:
data.duplicated().sum()

0

In [7]:
data.isna().sum()

Data      0
Labels    0
dtype: int64

In [9]:
# Removing stop words.

import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\venky\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [20]:
# Preprocessing function
def clean_text(text):
   
    text = text.lower()

    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    
    stop_words = set(stopwords.words('english'))
    text = ' '.join(word for word in text.split() if word not in stop_words)
    return text

In [21]:
data['Data'] = data["Data"].apply(clean_text)

In [22]:
data["Data"]

0       path cantaloupesrvcscmuedumagnesiumclubcccmued...
1       newsgroups altatheism path cantaloupesrvcscmue...
2       path cantaloupesrvcscmuedudasnewsharvardedunoc...
3       path cantaloupesrvcscmuedumagnesiumclubcccmued...
4       xref cantaloupesrvcscmuedu altatheism talkreli...
                              ...                        
1995    xref cantaloupesrvcscmuedu talkabortion altath...
1996    xref cantaloupesrvcscmuedu talkreligionmisc ta...
1997    xref cantaloupesrvcscmuedu talkorigins talkrel...
1998    xref cantaloupesrvcscmuedu talkreligionmisc al...
1999    xref cantaloupesrvcscmuedu sciskeptic talkpoli...
Name: Data, Length: 2000, dtype: object

In [23]:
# Vectorizing the data with TFIDF Vectorizer.
from sklearn.feature_extraction.text import TfidfVectorizer

In [24]:
TfIdf = TfidfVectorizer()
X = TfIdf.fit_transform(data["Data"])

In [25]:
Y=data["Labels"]

In [28]:
print(X)

  (0, 30846)	0.006374717673295059
  (0, 7267)	0.047828168304720566
  (0, 17172)	0.06862417373692555
  (0, 17169)	0.08717460331886823
  (0, 29678)	0.1372483474738511
  (0, 28591)	0.006374717673295059
  (0, 1403)	0.023284564343304878
  (0, 39952)	0.006374717673295059
  (0, 21744)	0.03406206589516532
  (0, 17837)	0.03677057630082175
  (0, 12163)	0.006374717673295059
  (0, 2150)	0.00694069564264917
  (0, 18142)	0.007324549863069558
  (0, 30105)	0.006621553676417196
  (0, 38045)	0.038990933140500325
  (0, 960)	0.03756845600025581
  (0, 24250)	0.00639386098399751
  (0, 13588)	0.016059083096566156
  (0, 45329)	0.017863933043927235
  (0, 26159)	0.006374717673295059
  (0, 36423)	0.050412893895282175
  (0, 34732)	0.009343519235184063
  (0, 35911)	0.10082578779056435
  (0, 36315)	0.10082578779056435
  (0, 34904)	0.10082578779056435
  :	:
  (1999, 35667)	0.16792227610159438
  (1999, 28739)	0.05820590101844723
  (1999, 24419)	0.05820590101844723
  (1999, 35767)	0.05820590101844723
  (1999, 12816)	0

In [29]:
Y

0              alt.atheism
1              alt.atheism
2              alt.atheism
3              alt.atheism
4              alt.atheism
               ...        
1995    talk.religion.misc
1996    talk.religion.misc
1997    talk.religion.misc
1998    talk.religion.misc
1999    talk.religion.misc
Name: Labels, Length: 2000, dtype: object

## Task 2 : Naive Bayes Model for Text Classification

In [30]:
# Split the data into train and test.
from sklearn.model_selection import train_test_split

train_x,test_x,train_y,test_y = train_test_split(X,Y,test_size=0.2,random_state=7)

In [33]:
print("train_x",train_x.shape)
print("train_y",train_y.shape)
print("test_x",test_x.shape)
print("test_y",test_y.shape)

train_x (1600, 46297)
train_y (1600,)
test_x (400, 46297)
test_y (400,)


In [35]:
# Building the model using  NaiveBaye's algorithm

from sklearn.naive_bayes import MultinomialNB

In [36]:
model = MultinomialNB()

In [37]:
model.fit(train_x,train_y)

In [38]:
## Evaluation
from sklearn.metrics import classification_report

y_pred = model.predict(test_x)

In [39]:
y_pred

array(['rec.sport.hockey', 'rec.sport.hockey', 'comp.os.ms-windows.misc',
       'misc.forsale', 'talk.politics.mideast', 'comp.os.ms-windows.misc',
       'talk.politics.guns', 'talk.religion.misc', 'talk.politics.misc',
       'talk.politics.mideast', 'rec.motorcycles', 'talk.politics.misc',
       'talk.politics.mideast', 'talk.politics.mideast',
       'rec.motorcycles', 'sci.med', 'comp.sys.mac.hardware',
       'rec.sport.hockey', 'comp.windows.x', 'talk.religion.misc',
       'talk.politics.guns', 'rec.sport.hockey',
       'comp.sys.ibm.pc.hardware', 'rec.sport.hockey', 'alt.atheism',
       'rec.sport.hockey', 'soc.religion.christian', 'comp.graphics',
       'rec.sport.hockey', 'comp.graphics', 'rec.sport.baseball',
       'comp.windows.x', 'comp.sys.ibm.pc.hardware',
       'soc.religion.christian', 'sci.electronics', 'sci.crypt',
       'sci.med', 'comp.graphics', 'talk.politics.misc',
       'comp.os.ms-windows.misc', 'rec.autos', 'rec.autos',
       'talk.religion.misc', 

In [41]:
print(classification_report(test_y,y_pred))

                          precision    recall  f1-score   support

             alt.atheism       0.79      0.71      0.75        21
           comp.graphics       0.80      0.84      0.82        19
 comp.os.ms-windows.misc       0.95      0.91      0.93        22
comp.sys.ibm.pc.hardware       0.81      0.65      0.72        26
   comp.sys.mac.hardware       0.86      0.86      0.86        22
          comp.windows.x       0.95      0.95      0.95        22
            misc.forsale       0.79      1.00      0.88        15
               rec.autos       1.00      0.84      0.91        19
         rec.motorcycles       0.95      0.95      0.95        19
      rec.sport.baseball       0.81      1.00      0.90        13
        rec.sport.hockey       0.96      1.00      0.98        22
               sci.crypt       0.94      1.00      0.97        17
         sci.electronics       0.84      0.80      0.82        20
                 sci.med       0.95      0.87      0.91        23
         

## Evaluation

In [None]:
import sklearn.metrics as metrics

In [45]:
print(metrics.accuracy_score(test_y,y_pred)*100)

84.0


In [49]:
print(metrics.precision_score(test_y,y_pred, average="weighted")*100)

86.33390325905343


In [50]:
print(metrics.recall_score(test_y,y_pred,average="weighted")*100)

84.0


In [51]:
print(metrics.f1_score(test_y,y_pred,average="weighted")*100)

83.78183170697372


## Task 3 : Sentiment Analysis


In [52]:
!pip install textblob

Collecting textblob
  Downloading textblob-0.18.0.post0-py3-none-any.whl.metadata (4.5 kB)
Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
   ---------------------------------------- 0.0/626.3 kB ? eta -:--:--
    --------------------------------------- 10.2/626.3 kB ? eta -:--:--
   - ------------------------------------- 30.7/626.3 kB 435.7 kB/s eta 0:00:02
   ---- ---------------------------------- 71.7/626.3 kB 660.6 kB/s eta 0:00:01
   -------------- ------------------------- 225.3/626.3 kB 1.5 MB/s eta 0:00:01
   -------------------------------- ------- 501.8/626.3 kB 2.6 MB/s eta 0:00:01
   ---------------------------------------- 626.3/626.3 kB 2.8 MB/s eta 0:00:00
Installing collected packages: textblob
Successfully installed textblob-0.18.0.post0


In [53]:
from textblob import TextBlob

In [55]:
def Sentiment(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity > 0:
        return "positive"
    elif analysis.sentiment.polarity == 0:
        return "Neutral"
    else:
        return "Negative"
    

In [56]:
data["Sentiment"] = data["Data"].apply(Sentiment)

In [57]:
data["Sentiment"]

0       positive
1       Negative
2       positive
3       positive
4       positive
          ...   
1995    positive
1996    positive
1997    positive
1998    positive
1999    positive
Name: Sentiment, Length: 2000, dtype: object

In [58]:
data.Sentiment.unique()

array(['positive', 'Negative', 'Neutral'], dtype=object)

In [59]:
data.Sentiment.value_counts()

positive    1453
Negative     544
Neutral        3
Name: Sentiment, dtype: int64