# Importing packages

Importing the necessary packages for the project.

In [1]:
import numpy as np
import pandas as pd 
import csv
import seaborn as sns
import matplotlib.pyplot as plt

# Displays output inline
%matplotlib inline

# Libraries for Handing Errors
import warnings
warnings.filterwarnings('ignore')

# Loading Data

Loading the dataset from the 'train.csv' file.

In [2]:
df = pd.read_csv("train.csv", index_col=False)

# Data Cleaning

In this section, we perform various data cleaning tasks to prepare the dataset for further analysis and modeling.


In [3]:
df.head(10)

Unnamed: 0,headlines,description,content,url,category
0,RBI revises definition of politically-exposed ...,The central bank has also asked chairpersons a...,The Reserve Bank of India (RBI) has changed th...,https://indianexpress.com/article/business/ban...,business
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,NDTV's consolidated revenue from operations wa...,Broadcaster New Delhi Television Ltd on Monday...,https://indianexpress.com/article/business/com...,business
2,"Akasa Air ‘well capitalised’, can grow much fa...",The initial share sale will be open for public...,Homegrown server maker Netweb Technologies Ind...,https://indianexpress.com/article/business/mar...,business
3,India’s current account deficit declines sharp...,The current account deficit (CAD) was 3.8 per ...,India’s current account deficit declined sharp...,https://indianexpress.com/article/business/eco...,business
4,"States borrowing cost soars to 7.68%, highest ...",The prices shot up reflecting the overall high...,States have been forced to pay through their n...,https://indianexpress.com/article/business/eco...,business
5,"India’s Russian oil imports slip in Oct, Saudi...",Russian crude accounted for nearly 35 per cent...,India’s oil imports from Russia averaged 1.57 ...,https://indianexpress.com/article/business/com...,business
6,Neelkanth Mishra appointed part-time chairpers...,The board of the UIDAI comprises a chairperson...,"Neelkanth Mishra, chief economist at Axis Bank...",https://indianexpress.com/article/business/eco...,business
7,Centre issues advisory to social media platfor...,The IT ministry had earlier also issued adviso...,The Ministry of Electronics and IT (MeitY) has...,https://indianexpress.com/article/business/cen...,business
8,Asian shares rise after eased pressure on bond...,US futures were little changed and oil prices ...,"Shares advanced Wednesday in Asia, tracking Wa...",https://indianexpress.com/article/business/wor...,business
9,India’s demand for electricity for ACs to exce...,India's demand for electricity for running hou...,nrIndia’s demand for electricity for running h...,https://indianexpress.com/article/business/eco...,business


# Dataset Summary

The dataset contains news articles from various domains, including Business, Technology, Sports, Education, and Entertainment. Each row in the dataset represents a single news article and consists of the following columns:

- **headlines**: The headline or title of the news article.
- **description**: A brief description or summary of the news article's content.
- **content**: The main body or text of the news article.
- **url**: The URL or web address of the news article.
- **category**: The category or domain to which the news article belongs (e.g., Business, Technology, Sports, etc.).

This dataset will be used for training a machine learning model to classify news articles into their respective categories based on the provided features (headlines, description, content, and URL).

## Dataset Size

The dataset contains a total of 5,520 rows (news articles) and 5 columns (features):

In [4]:
df.shape

(5520, 5)

This indicates that the dataset has 5,520 instances (news articles) and 5 features (headlines, description, content, url, category).

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5520 entries, 0 to 5519
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   headlines    5520 non-null   object
 1   description  5520 non-null   object
 2   content      5520 non-null   object
 3   url          5520 non-null   object
 4   category     5520 non-null   object
dtypes: object(5)
memory usage: 215.8+ KB


## Data Information

The DataFrame contains the following information:

- It is a pandas DataFrame object.
- The DataFrame has 5,520 rows (from index 0 to 5,519).
- There are 5 columns in the DataFrame.
- All columns are non-null, meaning there are no missing values.
- All columns have the data type 'object', which typically represents string data.
- The total memory usage of the DataFrame is approximately 215.8 KB.

In [6]:
class DataFrameProcessor:
    def __init__(self, df):
        self.df = df

    def check_duplicates(self):
        """
        Check for duplicate rows in the DataFrame and print the result.
        """
        duplicate_rows = self.df[self.df.duplicated()]

        if duplicate_rows.empty:
            print("No duplicates found.")
        else:
            print("Duplicates found!")
            print(duplicate_rows)

    def check_missing_values(self):
        """
        Check for missing values in the DataFrame and print the count of missing values in each column.
        """
        missing_values_count = self.df.isnull().sum()
        print("Missing values count per column:")
        print(missing_values_count)

    def convert_columns_to_lowercase(self, columns):
        """
        Convert specified columns to lowercase.
        """
        for column in columns:
            self.df[column] = self.df[column].str.lower()

    def remove_punctuation(self, columns):
        """
        Remove punctuation from specified columns.
        """
        for column in columns:
            self.df[column] = self.df[column].str.replace(f"[{string.punctuation}]", " ", regex=True)

    def clean_whitespace(self, columns):
        """
        Trim whitespace and replace multiple spaces with a single space in specified columns.
        """
        for column in columns:
            self.df[column] = self.df[column].str.strip()
            self.df[column] = self.df[column].str.replace("\s+", " ", regex=True)

    def remove_stopwords(self, columns):
        """
        Remove stopwords from specified columns.
        """
        stop_words = set(stopwords.words('english'))
        
        for column in columns:
            self.df[column] = self.df[column].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

    def tokenize_columns(self, columns):
        """
        Tokenize text in specified columns.
        """
        for column in columns:
            self.df[column] = self.df[column].apply(word_tokenize)

    def stem_columns(self, columns):
        """
        Stem words in specified columns using PorterStemmer.
        """
        stemmer = PorterStemmer()
        
        for column in columns:
            self.df[column] = self.df[column].apply(lambda x: [stemmer.stem(word) for word in x])

In [7]:
processor = DataFrameProcessor(df)

## Duplicate Rows Check

The following code checks for the presence of duplicate rows in the DataFrame:

In [8]:
processor.check_duplicates()

No duplicates found.


**No Duplicates Found**

The analysis has concluded that there are no duplicate entries in the dataset.


# Check for Missing Values and Print Count

Before proceeding with any data manipulation or analysis, it's crucial to assess the quality of the dataset. One common issue in datasets is missing values, which can lead to biased or inaccurate analyses if not addressed properly. Below, we demonstrate how to check for missing values and print the count of missing values in each column of our DataFrame.



In [9]:
processor.check_missing_values()

Missing values count per column:
headlines      0
description    0
content        0
url            0
category       0
dtype: int64


## Summary of Missing Values Count Per Column

After analyzing the dataset, we find that there are no missing values in any of the columns. Specifically, the counts are as follows:

- **Headlines**: 0
- **Description**: 0
- **Content**: 0
- **URL**: 0
- **Category**: 0

These results indicate that the dataset is clean and ready for further analysis without needing any preprocessing for missing values.


## Preparing Your Environment for NLP Tasks

Before diving into NLP tasks with NLTK, it's crucial to set up your environment correctly. This involves installing the required libraries and downloading the necessary NLTK corpora and tokenizers. Follow these steps to ensure your setup is ready:

1. **Import Necessary Libraries**

First, you'll need to import the libraries that will be used for various NLP tasks. This includes standard Python libraries for string manipulation, as well as specific NLTK modules for tokenization, stop words removal, and stemming/lemmatization. Execute the following code to import these libraries:



In [10]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

Downloading the NLTK data files.


2. **Download NLTK Data**

Some NLTK functionalities depend on external data that needs to be downloaded manually. To ensure these functionalities work correctly, run the following commands in your Python environment to download the necessary NLTK data:



In [11]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ismae\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ismae\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ismae\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Lowercasing Text Columns in DataFrame

To standardize the case of text data across different columns in a DataFrame, it's often beneficial to convert all text to lowercase. This ensures consistency in text comparisons and analyses. Below is the code to lowercase the text in specific columns of a DataFrame:



In [12]:
processor.convert_columns_to_lowercase(['headlines', 'description', 'content', 'url', 'category'])


This code snippet iterates over the specified columns ('headlines', 'description', 'content', 'url', and 'category') in the DataFrame `df` and applies the `.str.lower()` method to each, converting all text to lowercase.


## Removing Punctuation from Text Columns in DataFrame

To enhance the cleanliness and uniformity of text data, it's common to remove punctuation marks. This can simplify text analysis and improve the performance of certain algorithms. Below is the code to remove punctuation from specific columns of a DataFrame:



In [13]:
processor.remove_punctuation(['headlines', 'description', 'content', 'url', 'category'])


This code snippet iterates over the specified columns ('headlines', 'description', 'content', 'url', and 'category') in the DataFrame `df` and applies the `.str.replace()` method with a regular expression to replace all punctuation marks with spaces. This helps in cleaning the text data by eliminating punctuation while preserving the textual content.


## Cleaning Text Columns in DataFrame by Stripping Whitespace and Replacing Multiple Spaces

To further refine the text data in a DataFrame, it's important to remove leading and trailing whitespace and consolidate multiple consecutive spaces into single spaces. This enhances the readability and consistency of the text data. Below is the code to achieve this for specific columns of a DataFrame:



In [14]:
processor.clean_whitespace(['headlines', 'description', 'content', 'url', 'category'])


This code snippet iterates over the specified columns ('headlines', 'description', 'content', 'url', and 'category') in the DataFrame `df`. First, it uses `.str.strip()` to remove leading and trailing whitespace from each text entry. Then, it employs `.str.replace("\s+", " ", regex=True)` to replace sequences of one or more whitespace characters (including spaces, tabs, and newlines) with a single space. This process cleanses the text data by normalizing whitespace usage, making the text cleaner and more uniform.


## Removing Stop Words from Text Columns in DataFrame

Stop words are commonly used words that do not carry much meaning in the context of text analysis and can be safely removed to reduce noise in the data. Below is the code to remove English stop words from specific columns of a DataFrame:



In [15]:
processor.remove_stopwords(['headlines', 'description', 'content', 'url', 'category'])


This code snippet iterates over the specified columns ('headlines', 'description', 'content', 'url', and 'category') in the DataFrame `df`. For each column, it applies a lambda function that splits the text into individual words, filters out stop words, and then joins the remaining words back together with spaces. This process removes stop words from the text data, thereby enhancing the focus on more meaningful words and phrases.


## Tokenizing Text Columns in DataFrame Using NLTK

Tokenization is the process of breaking down text into individual words or tokens, which is a fundamental step in many natural language processing (NLP) tasks. Below is the code to tokenize the text in specific columns of a DataFrame using NLTK's `word_tokenize` function:



In [16]:
processor.tokenize_columns(['headlines', 'description', 'content', 'url', 'category'])


This code snippet iterates over the specified columns ('headlines', 'description', 'content', 'url', and 'category') in the DataFrame `df`. It applies the `word_tokenize` function from NLTK to each column, effectively splitting the text into individual words or tokens. This process prepares the text data for further NLP tasks such as filtering, stemming, or building frequency distributions.


## Stemming Text Columns in DataFrame Using NLTK

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. This technique is useful in natural language processing (NLP) for simplifying words and reducing the complexity of text data. Below is the code to apply stemming to the text in specific columns of a DataFrame using NLTK's `PorterStemmer`:



In [17]:
processor.stem_columns(['headlines', 'description', 'content', 'url', 'category'])


This code snippet iterates over the specified columns ('headlines', 'description', 'content', 'url', and 'category') in the DataFrame `df`. For each column, it applies a lambda function that stems each word in the list of tokens using the `PorterStemmer` instance. This process reduces words to their root forms, which can help in simplifying text data and potentially improving the performance of NLP tasks such as text classification or clustering.


In [18]:
(processor.df)

Unnamed: 0,headlines,description,content,url,category
0,"[rbi, revis, definit, polit, expos, person, ky...","[central, bank, also, ask, chairperson, chief,...","[reserv, bank, india, rbi, chang, definit, pol...","[http, indianexpress, com, articl, busi, bank,...",[busi]
1,"[ndtv, q2, net, profit, fall, 57, 4, rs, 5, 55...","[ndtv, consolid, revenu, oper, rs, 95, 55, cro...","[broadcast, new, delhi, televis, ltd, monday, ...","[http, indianexpress, com, articl, busi, compa...",[busi]
2,"[akasa, air, ‘, well, capitalis, ’, grow, much...","[initi, share, sale, open, public, subscript, ...","[homegrown, server, maker, netweb, technolog, ...","[http, indianexpress, com, articl, busi, marke...",[busi]
3,"[india, ’, s, current, account, deficit, decli...","[current, account, deficit, cad, 3, 8, per, ce...","[india, ’, s, current, account, deficit, decli...","[http, indianexpress, com, articl, busi, econo...",[busi]
4,"[state, borrow, cost, soar, 7, 68, highest, fa...","[price, shot, reflect, overal, higher, risk, a...","[state, forc, pay, nose, weekli, auction, debt...","[http, indianexpress, com, articl, busi, econo...",[busi]
...,...,...,...,...,...
5515,"[samsung, send, invit, ‘, unpack, 2024, ’, new...","[samsung, like, announc, next, gener, galaxi, ...","[samsung, plan, reveal, next, gener, flagship,...","[http, indianexpress, com, articl, technolog, ...",[technolog]
5516,"[googl, pixel, 8, pro, accident, appear, offic...","[pixel, 8, pro, like, carri, predecessor, desi...","[googl, accident, gave, us, glimps, upcom, fla...","[http, indianexpress, com, articl, technolog, ...",[technolog]
5517,"[amazon, ad, googl, search, redirect, user, mi...","[click, real, look, amazon, ad, open, page, su...","[new, scam, seem, make, round, internet, legit...","[http, indianexpress, com, articl, technolog, ...",[technolog]
5518,"[elon, musk, ’, s, x, previous, twitter, worth...","[elon, musk, x, formerli, twitter, lost, half,...","[year, elon, musk, acquir, twitter, 44, billio...","[http, indianexpress, com, articl, technolog, ...",[technolog]


Creating a pipeline