###### Name: Abd-el-rahman Zaglool

###### Project: News classification

For the project, I'll collect the news from The New York Times webpage. Articles will be retrieved from 8 different sections:
1. Business
2. Science
3. Health
4. Sports
5. Arts
6. Style 
7. Food
8. Travel

In [2]:
# importing the necessary packages

import requests
import time
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score

The New York Times offer a wide range of handy APIs. One of them is 'Times Wire API', which allows to retrieve the real-time feed of NYT published articles.

To fetch articles from the 'Times Wire API' I'll create a helper function named scrape_articles(). 

The function will take as inputs the desired news' section and the number of articles to retrieve ('limit' parameter).
 
The function returns the response as a JSON object.

In [17]:
def scrape_articles(section, limit):
  requestUrl = requestUrl = 'https://api.nytimes.com/svc/news/v3/content/all/'+section+'.json?limit='+str(limit)+'&api-key=X5IOvJkmFhSRsE7jrZHfMM25hEATDJpS'
  requestHeaders = {
    "Accept": "application/json"
  }

  response = requests.get(requestUrl, headers=requestHeaders).json()

  return response

I'll also define a second helper function in order to extract the necessary fields ('title', 'abstract' and 'section' of the article) from the response.

In [18]:
def retrieve_articles(response):
        articles = []
        docs = response['results']
        for doc in docs:
                filteredDoc = {}
                filteredDoc['title'] = doc['title']
                filteredDoc['abstract'] = doc['abstract']
                filteredDoc['section'] = doc['section']
                articles.append(filteredDoc)

        return articles        

Now I'll use the two functions defined above to iterate articles retrieval and fields extraction over all of the 8 sections.

The result will be a list of lists (8) where each list contains articles from one section only. 

The articles are stored as dictionaries with three keys: 'title', 'abstract' and 'section'.

From each section I've collected at most 500 articles. Number of articles collected may change among sections depending on the feed at the time of code execution.


In [41]:
sections = ['business', 'science', 'health', 'sports', 'arts', 'style', 'food', 'travel']
limit = 500
news = []

for section in sections:

    response = scrape_articles(section, limit)
    section_articles = retrieve_articles(response)
    news.append(section_articles)
    time.sleep(12)    

# time elapsed 1m 57s    

In [43]:
# Storing the retrieved news in a DataFrame 

df = pd.DataFrame()

# Iterate over each list of dictionaries
for sublist in news:
    # Create a temporary DataFrame for each list of dictionaries
    temp_df = pd.DataFrame(sublist)
    
    # Concatenate the temporary DataFrame with the main DataFrame
    df = pd.concat([df, temp_df], ignore_index=True)

df.to_excel('news.xlsx')    

In [23]:
df

Unnamed: 0,title,abstract,section
0,"As U.S. and Chinese Officials Meet, Businesses...",Chief executives in the U.S. have long pushed ...,Business
1,Investor and His Hedge Fund Are Rocked by Sex ...,"After accusations were published in Britain, C...",Business
2,3 Men Charged in Case That Spotlights Attacks ...,The homes of New Hampshire Public Radio journa...,Business
3,Two Former Tucker Carlson Producers Exit Fox News,The departures are the latest fallout since th...,Business
4,Twitch Star Signs $100 Million Deal With Rival...,"The deal signed by Félix Lengyel, known as xQc...",Business
...,...,...,...
3995,Beneath a Blanket of Stars,It isn’t as easy as it once was to find a dazz...,Travel
3996,Six Days Afloat in the Everglades,After a storm disrupted plans for a 99-mile pa...,Travel
3997,"‘I Identify as an Angler’: Meet Erica Nelson, ...","She hooks tree branches, slips on rocks, and s...",Travel
3998,Will You ‘Go Big’ in Travel This Year?,We want to know if you’ll be packing your bags...,Travel


In [28]:
# visualizing number of articles retrieved by section
df.section.value_counts()

section
Business    500
Science     500
Health      500
Sports      500
Arts        500
Style       500
Food        500
Travel      500
Name: count, dtype: int64

In [35]:
# dropping rows with NaN on both title and abstract:
df = df.drop(index= df[(df['title'].isna()) & df['abstract'].isna()].index)
df.reset_index(drop=True, inplace=True)

In [36]:
# Concatenating title and abstract into the same 'article' column
# Some rows have NA in title, others have NA in abstract so I have to handle the different cases 

df.loc[df[df['abstract'].isna()].index, 'title'] = df[df['abstract'].isna()]['title'].values #NA in abstract 
df.loc[df[(df['title'].notnull()) & (df['abstract'].notnull())].index, 'title'] = df[(df['title'].notnull()) & (df['abstract'].notnull())]['title'].values + ' : ' + df[(df['title'].notnull()) & (df['abstract'].notnull())]['abstract'].values #no NAs
df.loc[df[df['title'].isna()].index, 'title'] = df[df['title'].isna()]['abstract'].values #NA in title
df.rename(columns={'title' : 'article'}, inplace=True)
df.drop(columns= df.columns[1], axis=1, inplace=True)
df    

Unnamed: 0,article,section
0,"As U.S. and Chinese Officials Meet, Businesses...",Business
1,Investor and His Hedge Fund Are Rocked by Sex ...,Business
2,3 Men Charged in Case That Spotlights Attacks ...,Business
3,Two Former Tucker Carlson Producers Exit Fox N...,Business
4,Twitch Star Signs $100 Million Deal With Rival...,Business
...,...,...
3987,Beneath a Blanket of Stars : It isn’t as easy ...,Travel
3988,Six Days Afloat in the Everglades : After a st...,Travel
3989,"‘I Identify as an Angler’: Meet Erica Nelson, ...",Travel
3990,Will You ‘Go Big’ in Travel This Year? : We wa...,Travel


Now I'll set up the necessary components for text preprocessing tasks. 

Preprocessing involve text tokenization, stop words removal, part-of-speech mapping and lemmatization.

In [71]:
stop_words = set(stopwords.words('english')) #retrieving english stop words list

tokenizer = RegexpTokenizer(r'\w+') #initializing tokenizer

def get_wordnet_pos(word):
    """Maps the input word's POS tag to the appropriate WordNet POS tag"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer() #initializing lemmatizer

In [72]:
# Building a for loop to preprocess each article of the news dataframe

nrows= df.shape[0]
corpus = []

for i in range(0, nrows):
    globals()['article%s' % i] = df.iloc[i,0]

    globals()['word_tokens%s' % i] = tokenizer.tokenize(globals()['article%s' % i]) #tokenization by word
    globals()['filtered_sentence%s' % i] = [w.lower() for w in globals()['word_tokens%s' % i] if not w.lower() in stop_words]  # insert only words that aren't present in stop_words. In lowercase

    globals()['filtered_sentence_lemmatized%s' % i] = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in globals()['filtered_sentence%s' % i]] #lemmatize the worlds in globals()['filtered_sentence%s' % i]
    globals()['filtered_sentence_lemmatized%s' % i] = ' '.join(globals()['filtered_sentence_lemmatized%s' % i])
    corpus.append(globals()['filtered_sentence_lemmatized%s' % i])

# time elapsed: 5m 14s

I'll save then the processed articles as a new column in my DataFrame.

In addition, I'll factorize the sections (i.e. assign a different number to each section). 'Factorized_section' column will be my target variable in the news classification problem.  

In [25]:
df['processed_article'] = corpus
df['factorized_section'] = df['section'].factorize()[0]
df.dropna(inplace=True)
df

Unnamed: 0,article,section,processed_article,factorized_section
0,"As U.S. and Chinese Officials Meet, Businesses...",Business,u chinese official meet business temper hope c...,0
1,Investor and His Hedge Fund Are Rocked by Sex ...,Business,investor hedge fund rock sex assault allegatio...,0
2,3 Men Charged in Case That Spotlights Attacks ...,Business,3 men charge case spotlight attack medium home...,0
3,Two Former Tucker Carlson Producers Exit Fox N...,Business,two former tucker carlson producer exit fox ne...,0
4,Twitch Star Signs $100 Million Deal With Rival...,Business,twitch star sign 100 million deal rival platfo...,0
...,...,...,...,...
3986,Beneath a Blanket of Stars : It isn’t as easy ...,Travel,beneath blanket star easy find dazzle night sk...,7
3987,Six Days Afloat in the Everglades : After a st...,Travel,six day afloat everglades storm disrupt plan 9...,7
3988,"‘I Identify as an Angler’: Meet Erica Nelson, ...",Travel,identify angler meet erica nelson female indig...,7
3989,Will You ‘Go Big’ in Travel This Year? : We wa...,Travel,go big travel year want know pack bag 2022,7


In [46]:
# Show the factorization legend
df[['section', 'factorized_section']].drop_duplicates().sort_values('factorized_section')

Unnamed: 0,section,factorized_section
0,Business,0
500,Science,1
998,Health,2
1498,Sports,3
1992,Arts,4
2492,Style,5
2992,Food,6
3492,Travel,7


To be fed to the classification model, the processed articles must be converted into a numerical representation. In particular, with  the help Scikit-learn's CountVectorizer class I'll represent the articles as a sparse matrix with
- n rows = n processed articles 
- n columns = n unique words in the corpus

In [4]:
x = df.iloc[:,2].values
y = df.factorized_section.values

# Init CountVectorizer
cv = CountVectorizer()
# Fit to the corpus
x = cv.fit_transform(x).toarray()

At this point I only need to split the data in two portions (train and test)

In [5]:
# Test portion is 80% of data, test is the remaining 20%
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42, shuffle = True)

Now I'm ready to call the classification model.

To classify the articles I chose a basic Logistic Regression since it offers a good trade-off between performance and complexity (i.e. computational expense). 

Given that news classification problem involves multiple labels, I'll resort to OneVsRestClassifier() class. OneVsRest takes a binary classifier (the Logistic Regression) and wraps it to handle multiclass classification problems by decomposing them into multiple binary classification tasks.

To evaluate the predictive power of the model, considering the equal importance of all labels, I relied on the accuracy score.

In [21]:
mdl = LogisticRegression(random_state=42)

oneVsRest = OneVsRestClassifier(mdl)

oneVsRest.fit(x_train, y_train)

y_pred = oneVsRest.predict(x_test)

# Performance
accuracy = round(accuracy_score(y_test, y_pred)*100, 2)
print(f'Accuracy Score of Basic Logistic Regression: {accuracy} %')

Accuracy Score of Basic Logistic Regression: 79.6 %


The Accuracy score measures the proportion of correctly predicted labels compared to the total number of samples. On average, our Logistic regression correctly predicts the section of 8 articles out of 10! That's already a great result, which can be further enhanced by, for example, employing some hyperparameter fine-tuning...