<a href="https://colab.research.google.com/github/ramyakv7/RAmya-Kv-assignment-3-PCA/blob/main/NLP_Assignment_BBC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>
<font color=Blue>
News Articles Classification using NLP
</font>
</h1>

**Objective**

News articles are one of the richest sources of data for many businesses. ABC company wants to build a website and recommend the contents to its users on their web application. So any new article or content is coming they wants to classify that into under one of 5 categories: business, entertainment, politics, sport or tech. As an ML engineer you are required to use a public dataset from the BBC each labelled under one of 5 categories: business, entertainment, politics, sport or tech.

**The goal will be to build a system that can accurately classify previously unseen news articles into the right category.**

The Evaluation metric you should use is the** Accuracy.**

## Import  Libraries

In [None]:
from google.colab import files
uploaded = files.upload()

Saving BBC News.csv to BBC News.csv


In [None]:
import numpy as np
import pandas as pd

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

**Spacy model**

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
df = pd.read_csv('/content/BBC News.csv')
df.head()

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business


## Data Understanding

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1490 entries, 0 to 1489
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ArticleId  1490 non-null   int64 
 1   Text       1490 non-null   object
 2   Category   1490 non-null   object
dtypes: int64(1), object(2)
memory usage: 35.0+ KB


In [None]:
df.Category.value_counts()

sport            346
business         336
politics         274
entertainment    273
tech             261
Name: Category, dtype: int64

In [None]:
df['ArticleId'].nunique()

1490

In [None]:
df.drop('ArticleId', axis=1, inplace=True)

### Preprocessing using Spacy

In [None]:
def is_whitespace(data):
    
    blank = []
    for idx, text, label in data.itertuples():
        if text.isspace():
            blank.append(idx)
    
    return blank

In [None]:
is_whitespace(df)

[]

In [None]:
df.Text[0]



In [None]:
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct or token.is_space:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

**Remove stop words, punctuations from the text.**

In [None]:
df['processed_text'] = df['Text'].apply(preprocess)

In [None]:
df.head()

Unnamed: 0,Text,Category,processed_text
0,worldcom ex-boss launches defence lawyers defe...,business,worldcom ex boss launch defence lawyer defend ...
1,german business confidence slides german busin...,business,german business confidence slide german busine...
2,bbc poll indicates economic gloom citizens in ...,business,bbc poll indicate economic gloom citizen major...
3,lifestyle governs mobile choice faster bett...,tech,lifestyle govern mobile choice fast well funky...
4,enron bosses in $168m payout eighteen former e...,business,enron boss $ 168 m payout eighteen enron direc...


In [None]:
df.Text[0]



In [None]:
df.processed_text[0]



## Label Encoding and Data Splitting

In [None]:
le = LabelEncoder()
cat_fit = le.fit(df.Category)
y = cat_fit.transform(df.Category)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.processed_text, y, 
                                                    test_size=0.2, random_state=42)

In [None]:
X_train.shape, X_test.shape

((1192,), (298,))

In [None]:
y_train.shape, y_test.shape

((1192,), (298,))

## Data Modelling 

> We use sklearn pipelines to perform preprocessing and modelling in sequence.

### 1.Using CountVectorizer


In [None]:
model1 = Pipeline([('c_vectorizer', CountVectorizer(ngram_range=(1, 2))), 
                      ('bayes_model', MultinomialNB())])

In [None]:
model1.fit(X_train, y_train)

Pipeline(steps=[('c_vectorizer', CountVectorizer(ngram_range=(1, 2))),
                ('bayes_model', MultinomialNB())])

In [None]:
model1_pred = model1.predict(X_test)

In [None]:
print(f'Accuracy score of count vectorizer based model: {accuracy_score(y_test, model1_pred):.2f}')

Accuracy score of count vectorizer based model: 0.98


### 2.Using TFIDF Vectorizer

In [None]:
model2 = Pipeline([('t_vector', TfidfVectorizer()), 
                    ('bayes_model_2', MultinomialNB())])

In [None]:
model2.fit(X_train, y_train)

Pipeline(steps=[('t_vector', TfidfVectorizer()),
                ('bayes_model_2', MultinomialNB())])

In [None]:
model2_pred = model2.predict(X_test)

In [None]:
print(f'Accuracy score of tfidf based model: {accuracy_score(y_test, model2_pred):.2f}')

Accuracy score of tfidf based model: 0.96


**Inferences**:
- Countvectorizer gives a  result of 98% accuracy.
- Tfidf vectorizer based method gives only 96% accuracy.