This notebook pre-processes a dataset of news from some nepali news websites (in english language).

Install necessary libraries.

In [1]:
%pip install nltk

Note: you may need to restart the kernel to use updated packages.


Download nltk's corpus for English stopwords and model for lemmatization.

In [2]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /home/riwaj/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/riwaj/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Import functions from utils and other necessary libraries.

In [3]:
import utils
import csv

Set the file paths for input and output.

In [4]:
input_file = 'dataset.csv'
output_file = 'processed_dataset.csv'

Preview raw dataset's content.

In [5]:
with open(input_file, 'r', encoding='utf-8') as file:
    reader = list(csv.DictReader(file))
  
    print(reader[22].get('Content', ''))
    print(reader[33].get('Content', ''))
    print(reader[44].get('Content', ''))

KATHMANDU, MARCH 13 Prime Minister KP Sharma Oli has extended warm greetings and best wishes on the occasion of Phagu Poornima or Holi to all Nepalis living at home and abroad. Taking to his social media account, he has expressed his best wishes to all Nepali brothers and sisters, calling the festival of joy and enthusiasm also as Basantaotsav and Madanotsav. Stating that Phagu Poornima is celebrated during full moon of the spring season and considered the day of the renaissance of nature, Prime Minister Oli has said in his facebook page, "Best wishes to everyone on the occasion of Phagu Poornima, the festival of joy and enthusiasm. Happy Holi!"
KATHMANDU, FEBRUARY 20 Five separate reports were presented on Tuesday at the Nepali Congress (NC) Mahasamiti Meeting, which commenced on Monday in Godavari, Lalitpur. According to the NC, party Vice President and Deputy Prime Minister Purna Bahadur Khadka shared the Policy Report, while General Secretary Gagan Thapa tabled the Organizational P

Run the preprocessing process on the dataset.

In [6]:
with open(input_file, 'r', encoding='utf-8') as raw_data, open(output_file, 'w', newline='', encoding='utf-8') as outfile:
    reader = csv.DictReader(raw_data)
    fieldnames = reader.fieldnames + ['Processed']
    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    writer.writeheader()

    for row in reader:
        content = row.get('Content', '')
        text = utils.clean_text(content)
        tokens = utils.tokenize(text)
        tokens = utils.remove_stopwords(tokens)
        tokens = utils.lemmatize(tokens)
        tokens = utils.porter_stem(tokens)
        
        row['Processed'] = " ".join(tokens)
        writer.writerow(row)

print(f"Output saved to {output_file}")

Output saved to processed_dataset.csv


Preview pre-processed content.

In [7]:
with open(output_file, 'r', encoding='utf-8') as file:
    reader = list(csv.DictReader(file))
  
    print(reader[22].get('Processed', ''))
    print(reader[33].get('Processed', ''))
    print(reader[44].get('Processed', ''))

kathmandu march prime minister kp sharma oli extend warm greet best wish occasion phagu poornima holi nepali liv home abroad tak social medium account express best wish nepali brother sister call festival joy enthusiasm also basantaotsav madanotsav stat phagu poornima celebrat full moon spr season consider day renaissance nature prime minister oli said facebook page best wish everyone occasion phagu poornima festival joy enthusiasm happy holi
kathmandu february five separate report present tuesday nepali congres nc mahasamiti meet commenc monday godavari lalitpur accord nc party vice president deputy prime minister purna bahadur khadka shar policy report general secretary gagan thapa tabl organizational proposal likewise another general secretary bishow prakash sharma shar contemporary political report nc spokesperson finance minister dr prakash sharan mahat present report nepal present economic statu possibility future course similar account committee coordinator shyam kumar ghimire t