# Task
can we predict the category (business, entertainment, etc.) of a news article given only its headline?

can we predict the specific story that a news article refers to, given only its headline?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
df = pd.read_csv(r'C:\Users\hp\Desktop\BIA main 2\uci-news-aggregator.csv')




In [3]:
df.head()

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 422419 entries, 0 to 422418
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   ID         422419 non-null  int64 
 1   TITLE      422419 non-null  object
 2   URL        422419 non-null  object
 3   PUBLISHER  422417 non-null  object
 4   CATEGORY   422419 non-null  object
 5   STORY      422419 non-null  object
 6   HOSTNAME   422419 non-null  object
 7   TIMESTAMP  422419 non-null  int64 
dtypes: int64(2), object(6)
memory usage: 25.8+ MB


In [5]:
len(df['STORY'].unique())

7230

In [6]:
["TITLE" , "URL", "PUBLISHER","CATEGORY","STORY", "HOSTNAME", "TIMESTAMP" ]

['TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME', 'TIMESTAMP']

In [7]:
df.drop(['URL','HOSTNAME','TIMESTAMP'],axis=1,inplace=True)

In [8]:
df.head()

Unnamed: 0,ID,TITLE,PUBLISHER,CATEGORY,STORY
0,1,"Fed official says weak data caused by weather,...",Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM
1,2,Fed's Charles Plosser sees high bar for change...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM
2,3,US open: Stocks fall after Fed official hints ...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM
3,4,"Fed risks falling 'behind the curve', Charles ...",IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM


In [9]:
df.duplicated().sum()

0

In [10]:
#to find null values in the data
df.isnull().sum()

ID           0
TITLE        0
PUBLISHER    2
CATEGORY     0
STORY        0
dtype: int64

In [30]:
df.shape

(422417, 4)

In [31]:
#to remove null values
df.dropna(subset=["PUBLISHER"], axis=0, inplace=True)

In [14]:
#checking deleted null values
df.isnull().sum().to_frame().T

Unnamed: 0,ID,TITLE,PUBLISHER,CATEGORY,STORY
0,0,0,0,0,0


In [15]:
sample_df = df.sample(n=1200)

In [16]:
sample_df.head()

Unnamed: 0,ID,TITLE,PUBLISHER,CATEGORY,STORY
421938,422457,Obama administration picks CEO for HealthCare.gov,Fox News,m,drbHX2-GkgsIZjMovsFKIC2IOEHBM
401776,402295,China Pursues Monopoly Investigation of Microsoft,Wireless Week,t,dbiqwmupWq9ifvMVYDhySlUaxf2iM
139777,140113,Bryan Singer Lawsuit Update: Alleged Victim Mi...,Headlines \& Global News,e,df1xq0HmEv1vT0MdNZ37IHj0mRQPM
174651,174987,Watch Delvin Choice Sing on The Voice 2014 Liv...,Wetpaint,e,d3VULNVO4EviXoMeM4wyN7rMxClfM
371543,372003,AbbVie Ups the Stakes in Its Attempt to Acquir...,Motley Fool,b,dlknCqBVrOVJ4WMTGn5UzGeYobmoM


In [17]:
df = pd.DataFrame(df)

# Set the 'name' column as the index
df.set_index('ID', inplace=True)

# Print the index of the DataFrame
print(df.index)

Int64Index([     1,      2,      3,      4,      5,      6,      7,      8,
                 9,     10,
            ...
            422928, 422929, 422930, 422931, 422932, 422933, 422934, 422935,
            422936, 422937],
           dtype='int64', name='ID', length=422417)


In [18]:
#using sklearn for machine learning algorithums.

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression





In [19]:
#split the deta set innto training and testing deta sets
X_train, X_test, y_train, y_test = train_test_split(df['TITLE'], df['CATEGORY'])

In [20]:
X_train.head()

ID
277426    UPDATE 4-Emirates cancels 70-plane A350 order ...
208889    Malware More Likely To Affect Windows 7 And Vi...
115898    Concert review: Miley Cyrus dazzles the senses...
34333               Giraffe's 'goodbye' to dying zoo worker
143333    Nike fires majority of FuelBand team as it pre...
Name: TITLE, dtype: object

In [21]:
X_test.head()

ID
296646    'Batman vs Superman' Movie News: Metropolis Ho...
406917    Panasonic says to invest in $5 bln Tesla batte...
63708     Autism Now Affects 1 In 68 Children, A 37-Fold...
217967    Goodbye, Aio: Cricket Wireless Takes Over With...
379043    Roche and Exelixis herald a Phase III victory ...
Name: TITLE, dtype: object

In [22]:
#Extract features from the headlines using a count vectorizer
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [23]:
print(X_train_vec, X_test_vec)

  (0, 45702)	1
  (0, 15822)	1
  (0, 8953)	1
  (0, 2699)	1
  (0, 33242)	1
  (0, 3125)	1
  (0, 31374)	1
  (0, 22807)	1
  (0, 7442)	1
  (0, 43848)	1
  (0, 3987)	1
  (0, 37336)	1
  (1, 43848)	1
  (1, 27124)	1
  (1, 29112)	1
  (1, 26173)	1
  (1, 3774)	1
  (1, 47631)	2
  (1, 4542)	1
  (1, 46551)	1
  (1, 43365)	1
  (1, 48188)	1
  (2, 11157)	1
  (2, 36827)	1
  (2, 28410)	1
  :	:
  (316809, 19284)	1
  (316810, 22807)	1
  (316810, 47725)	1
  (316810, 41313)	1
  (316810, 31530)	1
  (316810, 15444)	1
  (316810, 40621)	1
  (316810, 17367)	1
  (316810, 42329)	1
  (316810, 39172)	1
  (316810, 9736)	1
  (316810, 48157)	1
  (316810, 7155)	1
  (316811, 22807)	1
  (316811, 43848)	1
  (316811, 43385)	1
  (316811, 42785)	1
  (316811, 17726)	1
  (316811, 26351)	1
  (316811, 40426)	1
  (316811, 26340)	1
  (316811, 3485)	1
  (316811, 14036)	1
  (316811, 11654)	1
  (316811, 28097)	1   (0, 6380)	1
  (0, 19523)	1
  (0, 21833)	1
  (0, 28205)	1
  (0, 29275)	1
  (0, 30133)	1
  (0, 42229)	2
  (0, 46707)	1
  (0, 4772

In [24]:
#train a logisticregressio model

model = LogisticRegression()
model.fit(X_train_vec, y_train)

#Evaluate the model on the testing set
score = model.score(X_train_vec, y_train)
print('Accuracy:', score)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy: 0.9652633107331793


In [25]:
#This code uses a trained machine learning model to make predictions on a set of test data.
y_pred = model.predict(X_test_vec)

In [26]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

# assume y_true and y_pred are the true and predicted labels respectively
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')

# to get a confusion matrix
confusion_mat = confusion_matrix(y_test, y_pred)


In [27]:
print(accuracy)
print(precision)
print(recall)
print(confusion_mat)

0.9470195539983902
0.9460821755701057
0.9421174682342748
[[27049   383   224  1361]
 [  460 37284   141   282]
 [  395   222 10520   145]
 [ 1478   384   120 25157]]


In [None]:
# Group the article by story id
story_groups = df.groupby('STORY')

#print the umer of stories ad the number of articles per story

print('Number of stories:', len(story_groups))
print('Number of articles per story:')
print(story_groups.size())

#can we predict the specific story that a news article refers to, given only its headline?

# justification of story prediction.


#`If you have only the story IDs and not the complete story texts, it may be more difficult to predict the categories of the stories based on their headlines alone. However, it may still be possible to perform some analysis using the available data.

One approach you could take is to try to extract some features from the headlines that may be indicative of the story category. For example, you could try using natural language processing techniques to extract named entities (such as people, organizations, and locations) from the headlines, or you could use sentiment analysis to determine the overall sentiment of the headline.

You could also try to use the available story IDs to link the headlines to other sources of information, such as external databases or websites, that may provide additional context about the stories and their categories.

      However, it's important to keep in mind that the accuracy of any predictions based on incomplete or limited data may be lower than if you had access to the complete story texts. It's also possible that some categories may be more difficult to predict based on the available data than others. Therefore, it's important to evaluate the performance of any analysis or predictions using appropriate metrics and to interpret the results with caution.
