# Tamil Wikipedia Text Analysis
## Introduction
Tamil, one of the world’s oldest languages, is rich in vocabulary and linguistic complexity. However, in the field of Natural Language Processing (NLP), it remains a relatively low-resource language, especially when compared to languages like English. This notebook aims to contribute to making Tamil a higher-resource language by analyzing a dataset of Tamil Wikipedia articles. Specifically, this project focuses on creating a word frequency count for Tamil Wikipedia articles, which can serve as a foundational resource for various NLP tasks, including language modeling, vocabulary analysis, and text mining.
### Project Overview
This notebook processes the Tamil Wikipedia dataset by counting the occurrences of each unique word across the entire collection of articles. With a comprehensive frequency list of Tamil words, this project provides insights into commonly used vocabulary and word distributions, which are essential for understanding and modeling the language. This word frequency dataset can be further utilized to support future projects, such as developing Tamil language models or improving tools for sentiment analysis, machine translation, and other language-based applications.
### Objectives
- **Load and Process Data**: Read and process individual text files containing Tamil Wikipedia articles, extracting individual words.
- **Frequency Analysis**: Count occurrences of each unique word across all articles, producing a sorted list of the most frequently used words.
- **Dataset for NLP**: Create a word frequency dataset that can be used as a resource for future NLP tasks in Tamil, helping to address the current lack of high-resource materials for the language.
### Relevance and Future Impact
The insights and resources generated from this analysis provide a foundational step towards building tools and datasets that support Tamil as a high-resource language in NLP. A comprehensive word frequency list is critical for tasks such as language modeling, dictionary creation, and vocabulary expansion. By understanding word distributions, researchers and developers can better adapt NLP models to accurately process and understand Tamil text. Future steps could involve expanding the dataset, filtering out commonly used words, and incorporating visualizations to deepen the analysis.
With this notebook, we take a small yet significant step towards empowering Tamil language processing, hoping to inspire further contributions from the NLP community.

In [1]:
# Import Libraries and Set Up Timer
import os
import pandas as pd
from collections import Counter
import re
import time

# Start timer for the entire notebook
notebook_start_time = time.time()
print("Libraries imported.")

Libraries imported.


In [2]:
# Set input directory paths based on the exact paths you provided
input_directory = "/kaggle/input/tamil-tamizh-wikipedia-articles/Tamil Wikipedia Text Articles/Tamil Wikipedia Text Articles"
output_csv = "/kaggle/working/tamil_word_counts.csv"

# Print paths for reference
print(f"Input directory: {input_directory}")
print(f"Output CSV path: {output_csv}")

Input directory: /kaggle/input/tamil-tamizh-wikipedia-articles/Tamil Wikipedia Text Articles/Tamil Wikipedia Text Articles
Output CSV path: /kaggle/working/tamil_word_counts.csv


In [3]:
def count_words_in_text(text):
    # Use a regex pattern to match whole words in Tamil script
    # \u0B80-\u0BFF is the Unicode range for Tamil characters
    words = re.findall(r'[\u0B80-\u0BFF]+', text)
    return Counter(words)

print("Word counting function defined for Tamil text.")

Word counting function defined for Tamil text.


In [4]:
total_word_count = Counter()
article_count = 0
print("Total word counter and article counter initialized.")

Total word counter and article counter initialized.


In [5]:
# Start timing for word and article counting
start_time = time.time()

# Regex pattern to match <doc> tags in the text files
doc_tag_pattern = re.compile(r'<doc id="[^"]*" title="[^"]*">')

# Traverse the AA and AB directories to process .txt files
for subdir in ["AA", "AB"]:
    subdir_path = os.path.join(input_directory, subdir)
    for file in os.listdir(subdir_path):
        if file.endswith('.txt'):
            file_path = os.path.join(subdir_path, file)
            with open(file_path, 'r', encoding='utf-8') as f:
                file_content = f.read()
                
                # Find all <doc> sections to identify articles
                articles = doc_tag_pattern.split(file_content)[1:]  # Split and remove any empty initial part

                # Count words in each article
                for article in articles:
                    total_word_count.update(count_words_in_text(article))
                
                # Update article count and progress tracking
                article_count += len(articles)
            print(f"Processed {file_path} with {len(articles)} articles.")

# End timing and print time taken for word counting and total articles processed
end_time = time.time()
print(f"Word counting completed in {end_time - start_time:.2f} seconds.")
print(f"Total articles processed: {article_count}")

Processed /kaggle/input/tamil-tamizh-wikipedia-articles/Tamil Wikipedia Text Articles/Tamil Wikipedia Text Articles/AA/wiki_80.txt with 2462 articles.
Processed /kaggle/input/tamil-tamizh-wikipedia-articles/Tamil Wikipedia Text Articles/Tamil Wikipedia Text Articles/AA/wiki_82.txt with 5573 articles.
Processed /kaggle/input/tamil-tamizh-wikipedia-articles/Tamil Wikipedia Text Articles/Tamil Wikipedia Text Articles/AA/wiki_01.txt with 1618 articles.
Processed /kaggle/input/tamil-tamizh-wikipedia-articles/Tamil Wikipedia Text Articles/Tamil Wikipedia Text Articles/AA/wiki_38.txt with 6252 articles.
Processed /kaggle/input/tamil-tamizh-wikipedia-articles/Tamil Wikipedia Text Articles/Tamil Wikipedia Text Articles/AA/wiki_95.txt with 3648 articles.
Processed /kaggle/input/tamil-tamizh-wikipedia-articles/Tamil Wikipedia Text Articles/Tamil Wikipedia Text Articles/AA/wiki_96.txt with 4354 articles.
Processed /kaggle/input/tamil-tamizh-wikipedia-articles/Tamil Wikipedia Text Articles/Tamil Wi

In [6]:
# Start timing for DataFrame creation, filtering, and sorting
df_start_time = time.time()

# Convert the Counter to a DataFrame
word_counts_df = pd.DataFrame(total_word_count.items(), columns=['word', 'count'])

# Sort DataFrame by count in descending order
sorted_word_counts_df = word_counts_df.sort_values(by='count', ascending=False)

# End timing for DataFrame operations
df_end_time = time.time()
print(f"DataFrame operations completed in {df_end_time - df_start_time:.2f} seconds.")

DataFrame operations completed in 1.49 seconds.


In [7]:
word_counts_df

Unnamed: 0,word,count
0,முக்கிய,42306
1,அறிவிப்பு,12470
2,இக்கட்டுரை,11228
3,தமிழ்நாடு,156983
4,அரசுத்,12235
...,...,...
2033590,அமாபாத்,1
2033591,ஜாப்ராபாத்தின்,1
2033592,தானேதார்கள்,1
2033593,ஜாஞ்சிராவின்,1


In [8]:
sorted_word_counts_df

Unnamed: 0,word,count
78,மற்றும்,408644
165,பகுப்பு,335572
2107,ஒரு,279287
102,ஆம்,205873
24,இந்த,185110
...,...,...
1037225,அர்வத்தைத்,1
1037224,ஊட்டியதாக்க்,1
1037223,மிச்சிகானுக்குத்,1
1037222,உரோமனும்,1


In [9]:
# Save the sorted DataFrame to CSV
sorted_word_counts_df.to_csv(output_csv, index=False)
print(f"Word counts saved to {output_csv}")

Word counts saved to /kaggle/working/tamil_word_counts.csv


In [10]:
# Calculate total time taken for the notebook
notebook_end_time = time.time()
print(f"Total notebook runtime: {notebook_end_time - notebook_start_time:.2f} seconds")

Total notebook runtime: 74.94 seconds


## Conclusion
In this notebook, we successfully processed and analyzed Tamil Wikipedia articles to produce a word frequency count. The resulting word frequency list captures the distribution of words used across Wikipedia articles, offering a valuable resource for further studies in Tamil language processing. This data can be utilized in applications ranging from language modeling to vocabulary expansion for Tamil NLP tasks.
This word frequency analysis serves as a foundational dataset, enabling more advanced text analysis, language modeling, and resource building for the Tamil language in NLP. Future steps could include filtering stopwords, visualizing the most frequent words, and exploring contextual usage patterns to enrich the dataset further.
### Next Steps
- **Data Cleaning**: Filter out stopwords or commonly used function words to focus on meaningful vocabulary.
- **Visualization**: Use word clouds or other visual tools to represent word distributions.
- **Language Modeling**: Leverage the word frequencies for language model training to improve NLP capabilities for Tamil.
The journey to make Tamil a high-resource language for NLP is ongoing, and this word frequency dataset contributes an essential piece to that goal.