In this notebook we download the [News Category Dataset](https://www.kaggle.com/datasets/rmisra/news-category-dataset) from Kaggle and prepare it for a small experiment of model fine-tuning, validation, and testing.

Install all necessary packages

In [1]:
!pip install sentence-transformers



Import all necessary libraries

In [2]:
import os
import pandas as pd
import json
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import numpy as np
import plotly.graph_objs as go

Utility function to download a dataset from Kaggle. If running on Google Colab, it expects you to upload your *kaggle.json* file containing your Kaggle API credentials by clicking the widget. If not, you need to manually upload your *kaggle.json* file.

In [3]:
def download_kaggle_dataset():
    if 'google.colab' in str(get_ipython()):
      from google.colab import files
      import zipfile

      uploaded = files.upload()
      os.makedirs('/root/.kaggle', exist_ok=True)
      os.rename('kaggle.json', '/root/.kaggle/kaggle.json')
      os.chmod('/root/.kaggle/kaggle.json', 600)

      !kaggle datasets download -d rmisra/news-category-dataset

      with zipfile.ZipFile('/content/news-category-dataset.zip', 'r') as zip_ref:
        zip_ref.extractall('/content/news-category-dataset')

    else:
      import kaggle
      kaggle.api.dataset_download_files('rmisra/news-category-dataset', path='.', unzip=True)

    print('Dataset downloaded and extracted.')

In [4]:
download_kaggle_dataset()

Saving kaggle.json to kaggle.json
news-category-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)
Dataset downloaded and extracted.


Read the downloaded dataset into a Pandas dataframe

In [5]:
if 'google.colab' in str(get_ipython()):
  df_path = '/content/news-category-dataset/News_Category_Dataset_v3.json'
else:
  df_path = './News_Category_Dataset_v3.json'

with open(df_path,'r') as f:
    jdata = f.read()

df = pd.DataFrame.from_records([json.loads(line) for line in jdata.split('\n') if line])

df

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22
...,...,...,...,...,...,...
209522,https://www.huffingtonpost.com/entry/rim-ceo-t...,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,TECH,Verizon Wireless and AT&T are already promotin...,"Reuters, Reuters",2012-01-28
209523,https://www.huffingtonpost.com/entry/maria-sha...,Maria Sharapova Stunned By Victoria Azarenka I...,SPORTS,"Afterward, Azarenka, more effusive with the pr...",,2012-01-28
209524,https://www.huffingtonpost.com/entry/super-bow...,"Giants Over Patriots, Jets Over Colts Among M...",SPORTS,"Leading up to Super Bowl XLVI, the most talked...",,2012-01-28
209525,https://www.huffingtonpost.com/entry/aldon-smi...,Aldon Smith Arrested: 49ers Linebacker Busted ...,SPORTS,CORRECTION: An earlier version of this story i...,,2012-01-28


In [6]:
df.dtypes

link                 object
headline             object
category             object
short_description    object
authors              object
date                 object
dtype: object

We filter out data older than 2018 and create a new column concatenating the headline and short description information

In [7]:
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df = df[df['date'] < '2018-01-01']

df['news'] = df['headline'] + ' ' + df['short_description']
df = df[['news', 'category']]

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['news'] = df['headline'] + ' ' + df['short_description']


Unnamed: 0,news,category
17257,Second Ship Suspected Of Providing Oil To Nort...,WORLD NEWS
17258,Iran Protests: Civil Rights Movement Or Revolu...,WORLD NEWS
17259,Iran's Protesters Defy Crackdown Warning As Pr...,WORLD NEWS
17260,New York Family Of 5 Among 12 Killed In Costa ...,WORLD NEWS
17261,Lessons From This Year's Open Enrollment Seaso...,POLITICS
...,...,...
209522,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,TECH
209523,Maria Sharapova Stunned By Victoria Azarenka I...,SPORTS
209524,"Giants Over Patriots, Jets Over Colts Among M...",SPORTS
209525,Aldon Smith Arrested: 49ers Linebacker Busted ...,SPORTS


Here we take a look at the distribution of news by their categories

In [8]:
df['category'].value_counts()

POLITICS          29672
WELLNESS          17827
ENTERTAINMENT     14338
TRAVEL             9815
STYLE & BEAUTY     9649
PARENTING          8677
HEALTHY LIVING     6679
FOOD & DRINK       6226
QUEER VOICES       5863
BUSINESS           5851
COMEDY             4732
SPORTS             4520
HOME & LIVING      4195
BLACK VOICES       4120
PARENTS            3919
THE WORLDPOST      3664
WEDDINGS           3651
DIVORCE            3426
IMPACT             3382
WOMEN              3245
CRIME              3231
GREEN              2593
WORLDPOST          2579
MEDIA              2522
RELIGION           2491
WEIRD NEWS         2464
STYLE              2220
SCIENCE            2138
TASTE              2087
TECH               2027
MONEY              1707
WORLD NEWS         1612
ARTS               1509
FIFTY              1401
GOOD NEWS          1398
ARTS & CULTURE     1326
ENVIRONMENT        1323
COLLEGE            1143
LATINO VOICES      1046
CULTURE & ARTS     1030
EDUCATION           972
Name: category, 

We focus on 2 pairs of categories: *WELLNESS* with *HEALTHY LIVING* and *POLITICS* with *TRAVEL*. We put these pairs in different dataframes

In [9]:
df = df[df['category'].isin(['POLITICS', 'WELLNESS', 'TRAVEL', 'HEALTHY LIVING'])]

df_wellness = df[df['category'] == 'WELLNESS'].sample(n=6679, random_state=369)
df_helth = df[df['category'] == 'HEALTHY LIVING']
df_wellness_health = df_wellness.append(df_helth)

df_politics = df[df['category'] == 'POLITICS'].sample(n=9815, random_state=369)
df_travel = df[df['category'] == 'TRAVEL']
df_politics_travel = df_politics.append(df_travel)

  df_wellness_health = df_wellness.append(df_helth)
  df_politics_travel = df_politics.append(df_travel)


Here we sample 300 datapoints from each dataframe, to later visualize their embeddings

In [10]:
df_wellness_health_sample = df_wellness_health.sample(n=300, random_state=369)
df_politics_travel_sample = df_politics_travel.sample(n=300, random_state=369)

We load the *all-MiniLM-L6-v2* model from the *sentence-transformers* package and compute sentence embeddings for each news in those 2 dataframes

In [11]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

df_wellness_health_sample['embeddings'] = model.encode(df_wellness_health_sample['news'].tolist(), batch_size=100, device='cpu',show_progress_bar=True).tolist()
df_politics_travel_sample['embeddings'] = model.encode(df_politics_travel_sample['news'].tolist(), batch_size=100, device='cpu',show_progress_bar=True).tolist()

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

This is a utility function to perform dimensionality reduction from the embeddings vectors and visualize them in 3-d scatter plots

In [12]:
def plot_3d(dataframe, reduction='T-SNE'):

    # Perform dimensionality reduction
    if reduction == 'PCA':
        reducer = PCA(n_components=3)
    elif reduction == 'T-SNE':
        reducer = TSNE(n_components=3)
    else:
        raise ValueError('Invalid dimensionality reduction method. Use "PCA" or "T-SNE".')

    # Extract embeddings and apply reduction
    embeddings = np.stack(dataframe['embeddings'].values)
    reduced_embeddings = reducer.fit_transform(embeddings)

    # Add reduced embeddings to dataframe
    dataframe['x'] = reduced_embeddings[:, 0]
    dataframe['y'] = reduced_embeddings[:, 1]
    dataframe['z'] = reduced_embeddings[:, 2]

    # Create the 3D plot with reduced dot size and transparency
    categories = dataframe['category'].unique()
    data = []
    for category, color in zip(categories, ['red', 'blue']):
        df_filtered = dataframe[dataframe['category'] == category]
        trace = go.Scatter3d(x=df_filtered['x'], y=df_filtered['y'], z=df_filtered['z'],
                             mode='markers',
                             name=category,
                             marker=dict(size=5, opacity=0.5, color=color))
        data.append(trace)

    layout = go.Layout(scene=dict(xaxis_title='X', yaxis_title='Y', zaxis_title='Z'),
                       width=800, height=600)
    fig = go.Figure(data=data, layout=layout)
    fig.show()

Here we visualize sentence embeddings from travel and politics news and we notice that the pre-trained embeddings do a good job in separating them in the embedding space

In [13]:
plot_3d(df_politics_travel_sample)

But here we see the pre-trained model is not able to do the same for the wellness and healthy living news

In [14]:
plot_3d(df_wellness_health_sample)

Now we create train, validation, and test samples from the dataframe we created containing wellness and healthy living news. We will use them to fine tune the model for it to learn better sentence embeddings for this types of news.

In [15]:
df_sample_train, df_sample_test = train_test_split(df_wellness_health, test_size=0.05, random_state=369)
df_sample_train, df_sample_val = train_test_split(df_sample_train, test_size=0.05, random_state=369)

print(df_sample_train.shape, df_sample_val.shape, df_sample_test.shape)

(12055, 2) (635, 2) (668, 2)


If running on Google Colab, we save the dataframes as CSV files in Google Drive. Otherwise, we just save them in the current folder

In [16]:
if 'google.colab' in str(get_ipython()):
  from google.colab import drive
  drive.mount('/content/gdrive')
  csv_path = '/content/gdrive/My Drive/Colab Data'
else:
  csv_path = '.'

df_sample_train.to_csv(os.path.join(csv_path, 'df_sample_train.csv'), index=False)
df_sample_val.to_csv(os.path.join(csv_path, 'df_sample_val.csv'), index=False)
df_sample_test.to_csv(os.path.join(csv_path, 'df_sample_test.csv'), index=False)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
