### Building a text classifier for classifying news articles

In this notebook, I will
1. Scrape data from a website called Inshorts
2. Scrape data for 3 news categories : sports, world, technology
3. Scrape news headline and news article
3. Build a text classifier to classify the news articles

Disclosure : web scraping code is referred from https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

In [1]:
#Import necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

1. requests to get the access to HTML content from the landing page
2. BeautifulSoup to parse the data and extract headline and articls for all the 3 categories (sports, technology and world affairs)

In [2]:
#urls to the news articles
seed_urls = ['https://www.inshorts.com/en/read/sports',
            'https://www.inshorts.com/en/read/technology',
            'https://www.inshorts.com/en/read/world']

In [3]:
#Function for scraping the data from given urls
def build_datasets(seed_urls):
    
    news_data = []
    
    for url in seed_urls:
        news_category = url.split('/')[-1]
        
        data = requests.get(url)
        soup = BeautifulSoup(data.content,'html.parser')
        
        news_articles = [{'news_headline': headline.find('span', 
                                                         attrs={"itemprop": "headline"}).string,
                          'news_article': article.find('div', 
                                                       attrs={"itemprop": "articleBody"}).string,
                          'news_category': news_category}
                         
                            for headline, article in 
                             zip(soup.find_all('div', 
                                               class_=["news-card-title news-right-box"]),
                                 soup.find_all('div', 
                                               class_=["news-card-content news-right-box"]))
                        ]
        news_data.extend(news_articles)
        
    df =  pd.DataFrame(news_data)
    df = df[['news_headline', 'news_article', 'news_category']]
  
    return df

In [4]:
news_df = build_datasets(seed_urls)

In [5]:
#Data has 73 news articles from 3 categories
news_df.shape

(73, 3)

In [6]:
#Distribution of news articles over 3 categories
news_df.news_category.value_counts()

sports        25
technology    24
world         24
Name: news_category, dtype: int64

### Train and Test Split

In [7]:
#Combining news article and news headline into a new column 'news'
news_df['news'] = news_df['news_headline']+news_df['news_article']

In [8]:
#splitting train and test data into 75% and 25% respectively
from sklearn.model_selection import train_test_split
X = news_df[['news']]
y = news_df['news_category']
X_train, X_test, y_train, y_test = train_test_split(X,y,stratify = y, test_size =0.25, random_state=42)

In [9]:
#each category has equal number of news article. 
y_train.value_counts()

sports        18
world         18
technology    18
Name: news_category, dtype: int64

### Building pipeline to first process the text data using TFIDF and then classifying using RidgeClassifier
#### 1. max_df and min_df parameters are tuned for TFIDF.
#### 2. alpha is tuned for ridge classifier.
#### 3. StratifiedKFold is used for preserving the percentage of samples for each class.
#### 4. ngram_range for TFIDF is set to (1,2), using unigrams and bigrams.

In [10]:
from sklearn.linear_model import RidgeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

cv = StratifiedKFold(n_splits=5)

pipeline = Pipeline([('tfidf',TfidfVectorizer(ngram_range=(1,2))),
                     ('clf',RidgeClassifier())])

param_grid = ({'tfidf__max_df':[.6,.7,.8,.9],
              'tfidf__min_df':[0,.1,.2],
              'clf__alpha': [.001,.01,.1,1,10,100,1000]})

grid_search = GridSearchCV(estimator=pipeline, cv=5, param_grid=param_grid)

grid_search.fit(X_train.news, y_train)
y_pred = grid_search.predict(X_test.news)

print(classification_report(y_pred=y_pred, y_true = y_test))

              precision    recall  f1-score   support

      sports       0.86      0.86      0.86         7
  technology       1.00      0.67      0.80         6
       world       0.75      1.00      0.86         6

   micro avg       0.84      0.84      0.84        19
   macro avg       0.87      0.84      0.84        19
weighted avg       0.87      0.84      0.84        19





In [11]:
#Best parameters are given as :
grid_search.best_params_

{'clf__alpha': 0.001, 'tfidf__max_df': 0.7, 'tfidf__min_df': 0}