Algorithm to perform the project:
1. Requirement Analysis and problem understanding
2. Data Collection using web scraping
3. Data Preparation + exploratory data analytics
4. Data Processing (encoding, decoding, feature scaling)

# Requirement/problem Analysis



1.   Choose a public news website of your choice.
2.   Create a script to scrape news from the news websites. Scrape about 100 news articles from the website
3.   tag them using classification provided by the website.
4.   Use a text classification model of your choice and test if the model is able to classify correctly.

In [1]:
class RequirementAnalysis:
    def __init__(self, project_name, stakeholders, objectives, constraints):
        self.project_name = project_name
        self.stakeholders = stakeholders
        self.objectives = objectives
        self.constraints = constraints

    def display_requirements(self):
        print("Project Name:", self.project_name)
        print("\nStakeholders:")
        for stakeholder in self.stakeholders:
            print("-", stakeholder)

        print("\nObjectives:")
        for objective in self.objectives:
            print("-", objective)

        print("\nConstraints:")
        for constraint in self.constraints:
            print("-", constraint)


# Example Usage
if __name__ == "__main__":
    # Define project requirements
    project_name = "News-Classification using text classificaton algorithm"
    stakeholders = ["NonStop io Company", "PCET Placement Cell"]
    objectives = [
         """Implement a text classification model that accurately assigns news articles
  to predefined categories, ensuring high precision and recall rates to minimize misclassifications."""
    ]
    constraints = ["Accuracy score or error rate", "Timeline limitations"]

    # Create an instance of RequirementAnalysis
    requirements = RequirementAnalysis(
        project_name, stakeholders, objectives, constraints
    )

    # Display the requirements
    requirements.display_requirements()


Project Name: News-Classification using text classificaton algorithm

Stakeholders:
- NonStop io Company
- PCET Placement Cell

Objectives:
- Implement a text classification model that accurately assigns news articles
  to predefined categories, ensuring high precision and recall rates to minimize misclassifications.

Constraints:
- Accuracy score or error rate
- Timeline limitations


# Data Collection

1. data scraping source : The New York Times
2. Tool  we have used   : BeautifulSoup (python library)
3. data store in file name : news_data.csv

In [2]:
import requests
from bs4 import BeautifulSoup
import csv

def scrape_website(url):
    # Make the request to the website
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Your scraping logic goes here
        # For example, let's extract all the links on the page
        links = soup.find_all('a')

        # Create a list to store the scraped data
        data = []

        for link in links:
            data.append({
                'Link Text': link.text,
                'URL': link.get('href', '')
            })

        return data
    else:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")
        return None

def save_to_csv(data, file_name='scraped_data.csv'):
    # Save the data to a CSV file
    with open(file_name, 'w', newline='', encoding='utf-8') as csv_file:
        fieldnames = ['Link Text', 'URL']
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

        # Write the header
        writer.writeheader()

        # Write the data
        writer.writerows(data)

if __name__ == "__main__":
    # Specify the URL of the website you want to scrape
    website_url = 'https://www.nytimes.com/international/'

    # Scrape data from the website
    scraped_data = scrape_website(website_url)

    if scraped_data:
        # Save the data to a CSV file
        save_to_csv(scraped_data, 'news_data.csv')
        print("The file created successfully")
    else:
        print("Access denied for the data ")


The file created successfully


# Data Preparation

In [3]:
#importing the created dataset
import pandas as pd
data = pd.read_csv("/content/news_data.csv")

In [4]:
#checking data size
data.shape

(649, 2)

In [5]:
#checking data top and bottom samples
data.head(10)
#data.tail()

Unnamed: 0,Link Text,URL
0,Skip to content,#site-content
1,Skip to site index,#site-index
2,SKIP ADVERTISEMENT,#after-dfp-ad-top
3,Skip to content,#site-content
4,Skip to site index,#site-index
5,,/
6,U.S.,/
7,International,/international/
8,Canada,/ca/
9,Español,https://www.nytimes.com/es/


In [6]:
#checking the null values
data.isnull().sum()

Link Text    4
URL          4
dtype: int64

In [7]:
#removing the null values
data.dropna(inplace=True)
data.isnull().sum()

Link Text    0
URL          0
dtype: int64

In [8]:
#shape of updated dataset
data.shape

(641, 2)

In [9]:
#getting more information for the null values
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 641 entries, 0 to 648
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Link Text  641 non-null    object
 1   URL        641 non-null    object
dtypes: object(2)
memory usage: 15.0+ KB


In [10]:
# checking for duplicate values if present
duplicate_values = data.duplicated()
count = 0
for i in duplicate_values:
  if i == True:
    count=count+1
print("total duplicated values: ",count)
print("shape of dataset: ",data.shape)

total duplicated values:  300
shape of dataset:  (641, 2)


In [11]:
#removing the duplicated value to avoid biases in training and testing
removed_values = data.drop_duplicates(inplace=True)
print("total duplicated values removed: ",removed_values)
print("current shape of dataset: ",data.shape)

total duplicated values removed:  None
current shape of dataset:  (341, 2)


In [12]:
# removing unwanted data for example in url column some ata is present which is not a url so we have to remove it
data.head()

Unnamed: 0,Link Text,URL
0,Skip to content,#site-content
1,Skip to site index,#site-index
2,SKIP ADVERTISEMENT,#after-dfp-ad-top
6,U.S.,/
7,International,/international/


In [13]:
import pandas as pd
import re

# df = pd.DataFrame(data)

# Function to filter out non-URLs
def clean_url(url):
    # Basic URL pattern match
    url_pattern = re.compile(r'https?://\S+')
    return url if url_pattern.match(url) else None

# Clean the 'URL' column
data['URL'] = data['URL'].apply(clean_url)

# Drop rows with non-URLs
data.dropna(subset=['URL'],inplace=True)

In [14]:
data.head()

Unnamed: 0,Link Text,URL
9,Español,https://www.nytimes.com/es/
10,中文,https://cn.nytimes.com
11,Today’s Paper,https://www.nytimes.com/section/todayspaper
13,U.S.,https://www.nytimes.com/international/section/us
15,Politics,https://www.nytimes.com/international/section/...


In [15]:
# removing the unwanted data from the link text column

# Function to filter out unwanted text
def clean_link_text(text):
    # Use regular expression to check if the text contains only English letters
    english_letters_pattern = re.compile(r'^[a-zA-Z\s]+$')

    # Check if the text contains only English letters
    return text if english_letters_pattern.match(text) else None

# Clean the 'Link Text' column
data['Link Text'] = data['Link Text'].apply(clean_link_text)

# Drop rows with unwanted text
data.dropna(subset=['Link Text'],inplace=True)

In [16]:
data.head(5)

Unnamed: 0,Link Text,URL
15,Politics,https://www.nytimes.com/international/section/...
16,New York,https://www.nytimes.com/international/section/...
17,California,https://www.nytimes.com/column/california-today
18,Education,https://www.nytimes.com/international/section/...
19,Health,https://www.nytimes.com/international/section/...


# Data Processing


In [18]:
# import pandas as pd
# from sklearn.preprocessing import LabelEncoder
# # Label Encoding for 'Link Text'
# label_encoder_link_text = LabelEncoder()
# data['Link Text'] = label_encoder_link_text.fit_transform(data['Link Text'])

# # Label Encoding for 'URL'
# label_encoder_url = LabelEncoder()
# data['URL'] = label_encoder_url.fit_transform(data['URL'])

In [19]:
data.head()

Unnamed: 0,Link Text,URL
15,Politics,https://www.nytimes.com/international/section/...
16,New York,https://www.nytimes.com/international/section/...
17,California,https://www.nytimes.com/column/california-today
18,Education,https://www.nytimes.com/international/section/...
19,Health,https://www.nytimes.com/international/section/...


In [20]:
data

Unnamed: 0,Link Text,URL
15,Politics,https://www.nytimes.com/international/section/...
16,New York,https://www.nytimes.com/international/section/...
17,California,https://www.nytimes.com/column/california-today
18,Education,https://www.nytimes.com/international/section/...
19,Health,https://www.nytimes.com/international/section/...
...,...,...
642,Terms of Sale,https://help.nytimes.com/hc/en-us/articles/115...
644,Canada,https://www.nytimes.com/ca/
645,International,https://www.nytimes.com/international/
646,Help,https://help.nytimes.com/hc/en-us


# Model Building

In [21]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

In [22]:
# Assuming 'df' is your DataFrame with columns 'Link Text' and 'URL'
x = data['URL']
y = data['Link Text']  # Assuming 'Label' is 1 if the link text corresponds to the URL, 0 otherwise

In [23]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=1)

In [24]:
# Feature extraction using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [25]:
# Choose a model (Logistic Regression in this case)
model = RandomForestClassifier()
model.fit(X_train_tfidf, y_train)

In [26]:
# Make predictions on the test set
predictions = model.predict(X_test_tfidf)

In [27]:
# # getting accuracy of the project
# accuracy = accuracy_score(y_test,predictions)
# print(accuracy)

In [28]:
# print(classification_report(y_test, predictions))

# Actual Model (Predictive System)

In [29]:
# Get user input
user_input = input("Enter the link text: ")

# Feature extraction using the loaded TF-IDF vectorizer
user_input_tfidf = tfidf_vectorizer.transform([user_input])

# Make prediction
prediction = model.predict(user_input_tfidf)

print(prediction)

print("The Category of the news is : ",prediction)

Enter the link text: https://www.nytimes.com/international/section/health
['Food']
The Category of the news is :  ['Food']
