In [1]:
import pandas as pd
import langdetect
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')

def detect_language(text):
    try:
        return langdetect.detect(text)
    except:
        return "Unknown"

def word_count(text):
    return len(word_tokenize(text))

def sentence_count(text):
    return len(sent_tokenize(text))

def process_headlines(dataset_path):
    df = pd.read_csv(dataset_path)
    df['Language'] = df['headline'].apply(detect_language)
    df['Word Count'] = df['headline'].apply(word_count)
    df['Sentence Count'] = df['headline'].apply(sentence_count)
    df['Tokens'] = df['headline'].apply(word_tokenize)
    print(df[['headline', 'Language', 'Word Count', 'Sentence Count', 'Tokens']].head())

dataset_path = "NewsCategorizer.csv"
process_headlines(dataset_path)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


                                            headline Language  Word Count  \
0              143 Miles in 35 Days: Lessons Learned       fr           8   
1       Talking to Yourself: Crazy or Crazy Helpful?       en           9   
2  Crenezumab: Trial Will Gauge Whether Alzheimer...       en          15   
3                     Oh, What a Difference She Made       en           7   
4                                   Green Superfoods       af           2   

   Sentence Count                                             Tokens  
0               1    [143, Miles, in, 35, Days, :, Lessons, Learned]  
1               1  [Talking, to, Yourself, :, Crazy, or, Crazy, H...  
2               1  [Crenezumab, :, Trial, Will, Gauge, Whether, A...  
3               1            [Oh, ,, What, a, Difference, She, Made]  
4               1                                [Green, Superfoods]  


🔹 1. Importing Required Libraries

import pandas as pd

    Why: pandas is used for loading and manipulating structured data (like CSV).

    Use: We’ll use it to read the dataset and process each row (headline).

import langdetect

    Why: langdetect is a library used to detect the language of a given text.

    Use: We'll use this to identify the language of each headline.

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

    Why: nltk is a popular library for Natural Language Processing (NLP).

        word_tokenize: splits text into words.

        sent_tokenize: splits text into sentences.

    Use: These are used to count words, count sentences, and tokenize text.

🔹 2. Download Required Tokenizer

nltk.download('punkt')

    Why: Downloads the Punkt tokenizer model, which is needed by NLTK to tokenize words and sentences.

    Use: Without this, word_tokenize and sent_tokenize won’t work.

🔹 3. Define Helper Functions
📌 Language Detection

def detect_language(text):
    try:
        return langdetect.detect(text)
    except:
        return "Unknown"

    What it does: Tries to detect the language of the input text.

    try-except block: Prevents the program from crashing if detection fails.

    Returns: Language code (like "en" for English), or "Unknown" if detection fails.

📌 Word Count

def word_count(text):
    return len(word_tokenize(text))

    What it does: Tokenizes the text into words and returns the number of tokens.

    Use: This gives the number of words in the text.

📌 Sentence Count

def sentence_count(text):
    return len(sent_tokenize(text))

    What it does: Tokenizes the text into sentences and returns the count.

    Use: Gives the number of sentences in the headline.

🔹 4. Process Dataset Function

def process_headlines(dataset_path):

    Defines a function to load the dataset and apply all processing steps.

📌 Load Dataset

    df = pd.read_csv(dataset_path)

    What it does: Reads the CSV file into a DataFrame df.

    Use: The dataset is assumed to have a column named "headline".

📌 Apply NLP Functions to Each Headline

    df['Language'] = df['headline'].apply(detect_language)

    Detects language for each headline.

    Adds a new column Language to store the result.

    df['Word Count'] = df['headline'].apply(word_count)

    Counts words in each headline.

    Adds the count to a new column Word Count.

    df['Sentence Count'] = df['headline'].apply(sentence_count)

    Counts sentences in each headline.

    Saves result in the Sentence Count column.

    df['Tokens'] = df['headline'].apply(word_tokenize)

    Tokenizes each headline into words.

    Adds the list of words into a new column Tokens.

📌 Display Result

    print(df[['headline', 'Language', 'Word Count', 'Sentence Count', 'Tokens']].head())

    Displays the first 5 rows of selected columns.

    Good for verification that all processing worked correctly.

🔹 5. Set Dataset Path & Run the Function

dataset_path = "NewsCategorizer.csv"

    Sets the path to your dataset CSV file (make sure it's in the same directory or provide full path).

process_headlines(dataset_path)

    Calls the function to run all analysis on the dataset.

✅ Summary for Viva:

    This script processes headlines from a CSV file.

    It applies language detection, word count, sentence count, and tokenization using langdetect and nltk.

    Results are stored as new columns in the DataFrame.

    Helps in preprocessing text for NLP tasks like classification, summarization, or clustering.

In [2]:
pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993223 sha256=a44d0078792344fa4e9c085851f1ca7a174e621d79205974af053da17869c8f1
  Stored in directory: /root/.cache/pip/wheels/0a/f2/b2/e5ca405801e05eb7c8ed5b3b4bcf1fcabcd6272c167640072e
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9
