# Text Classification: Dictionary Algorithm

## Table of Contents

1. [Introduction: Custom Text Classification for Enhanced Data Insights](#Introduction)
2. [Building the Algorithm](#Building-the-Algorithm)
3. [Example: Dictionary](#Example-Dictionary)
4. [Example: Individual Strings](#Example-Individual-Strings)
5. [Example: Data Frame](#Example-Data-Frame)

## Introduction: Custom Text Classification for Enhanced Data Insights <a id="Introduction"></a>

Welcome to this Jupyter notebook that embarks on an innovative journey in text analysis. Our focus here is not just on extracting key terms from text strings, but on elevating this process by assigning these terms to custom classifications that resonate with specific business contexts and use cases. In doing so, we transition from a generic clustering approach to a more targeted text classification methodology.

### Shifting from Clustering to Custom Classification
Traditional text clustering, an unsupervised machine learning method, often groups text in a manner that may not be directly meaningful or applicable in certain business scenarios. To counter this, we pivot to a supervised learning model, utilizing predefined categories that are more aligned with our specific objectives and contextual understanding.

### The Code: A Synergy of Text Processing and Custom Mapping
This notebook presents a comprehensive codebase that integrates several Python libraries and techniques:
- **Libraries and Preprocessing**: Utilizing libraries like `pandas` for data manipulation, `nltk` for natural language processing, and standard tools for text normalization and preprocessing.
- **Advanced Keyword Extraction**: Employing methods to extract not just individual words but also significant pairs (bigrams) and triples (trigrams) of words, adding depth to our analysis.
- **Creating a Custom Dictionary**: We define a dictionary with keys representing our bespoke categories (like 'Environment', 'Technology', 'Sports') and values that are specific terms or phrases.
- **Classification Algorithm**: Our custom function, `find_key_by_value`, processes text data to classify it under the most relevant category from our dictionary, based on the occurrence and relevance of keywords.
- **Practical Application on Data**: We apply this function to real data, both on individual text strings and on a DataFrame, demonstrating its practical utility in assigning relevant topics to text data.

### Objective and Conclusion
The goal of this notebook is to provide a framework that transforms general text data into meaningful, context-specific classifications. By the end of this exploration, you will have a versatile tool for text analysis that not only offers insights tailored to specific business needs but also enhances the decision-making process with data that is classified in a more meaningful and contextually relevant manner.

## Building the Algorithm <a id="Building-the-Algorithm"></a>

### Essential Python Libraries for Dictionary Algorithm

Before exploring this Dictionary Algorithm notebook, let's import the necessary Python libraries that will be pivotal in building this algorithm:

- **`pandas`**: A foundational library in Python for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series, making it ideal for handling and analyzing large datasets, such as collections of text data.

- **`nltk`**: The Natural Language Toolkit, a comprehensive library in Python for processing and analyzing human language data. It includes libraries for breaking texts down into their constituent parts, tagging them, identifying semantic information, and categorizing them.

- **`nltk.corpus.stopwords`**: A module within NLTK specifically for accessing a collection of 'stop words'. Stop words are commonly used words (such as 'the', 'is', 'in') that are typically ignored in text processing and natural language understanding tasks because they carry minimal meaningful content.

- **`string`**: This standard Python library is essential for handling and manipulating string data. It provides capabilities such as basic string formatting, constants, and utility functions, which are particularly useful for tasks like punctuation removal in text processing.

- **`unicodedata`**: A Python module that provides access to the Unicode Character Database. In text processing, this is particularly useful for normalizing texts, ensuring consistent character representation across different text samples and encoding formats.

- **`collections.Counter`**: Part of the Python collections module, `Counter` is a subclass of dictionary that's used for counting hashable objects. It's an ideal tool for keeping track of word frequencies in a text, which is a common requirement in various text analysis tasks.

By importing these libraries at the beginning, we ensure that all the necessary tools are readily available for efficient and effective text analysis and processing. Each library plays a crucial role in handling different aspects of language data and data manipulation, ensuring a comprehensive approach to our text processing tasks.

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
import unicodedata
from collections import Counter
import unittest
import math

# Set the display.max_colwidth option to -1 to display the full contents of columns
pd.set_option('display.max_colwidth', None)  # or use -1 for older versions of pandas

### Enhanced Text Analysis for Keyword Extraction

This Python code comprises three functions: `find_max_sequence_length`, `calculate_weight`, and `find_key_by_value`. Together, they offer a sophisticated method for extracting the most relevant keyword from a text based on a predefined dictionary of keywords.

#### Function `find_max_sequence_length`

- **Purpose**: Determines the maximum number of words in any phrase within the values of a given dictionary.
- **Process**:
  - Iterates through each value in the dictionary.
  - Splits each phrase into words and counts the number.
  - Keeps track of the maximum word count found.
- **Usage**: Assists in dynamically determining the analysis depth for the input string in `find_key_by_value`.

In [2]:
def find_max_sequence_length(dictionary):
    """
    Args:
    dictionary (dict): A dictionary where each value is a list of strings (phrases).

    Returns:
    int: The maximum number of words found in any single phrase in the dictionary values.

    Raises:
    TypeError: If the input is not a dictionary or if the dictionary values are not lists.
    ValueError: If any value in the dictionary is not a list of strings.
    """
    if not isinstance(dictionary, dict):
        raise TypeError("Input must be a dictionary.")

    max_length = 0
    for values in dictionary.values():
        if not all(isinstance(phrase, str) for phrase in values):
            raise ValueError("All values in the dictionary must be lists of strings.")

        for phrase in values:
            length = len(phrase.split())
            max_length = max(max_length, length)

    return max_length

In [3]:
class TestFindMaxSequenceLength(unittest.TestCase):
    def test_empty_dictionary(self):
        """Test an empty dictionary."""
        self.assertEqual(find_max_sequence_length({}), 0)

    def test_single_word_values(self):
        """Test a dictionary with single-word values."""
        dictionary = {'key1': ['word', 'another'], 'key2': ['test']}
        self.assertEqual(find_max_sequence_length(dictionary), 1)

    def test_multi_word_values(self):
        """Test a dictionary with multi-word values."""
        dictionary = {'key1': ['single word', 'two words'], 'key2': ['this is three', 'four word phrase']}
        self.assertEqual(find_max_sequence_length(dictionary), 3)

# Run the tests
unittest.main(argv=[''], exit=False)

...
----------------------------------------------------------------------
Ran 3 tests in 0.007s

OK


<unittest.main.TestProgram at 0x2290f39ca90>

#### Function `calculate_weight`

- **Purpose**: Provides a weighting mechanism for word combinations based on their length.
- **Process**:
  - Given the length of a word combination (e.g., single, pair, triple), calculates a weight.
  - Uses a linear formula with a multiplier (0.5 by default) to ensure longer combinations have more influence, but in a balanced manner.
- **Usage**: Used in `find_key_by_value` to weight matches according to their word count, giving preference to longer, more contextually rich matches.

In [4]:
def calculate_weight(length, weight_factor=0.5):
    """
    Args:
    length (int): The length of the word sequence.
    weight_factor (float, optional): The multiplier used to calculate the weight. Default is 0.5.

    Returns:
    float: The calculated weight for the given length.

    Raises:
    TypeError: If the input 'length' is not an integer or 'weight_factor' is not a float.
    ValueError: If the 'length' is less than 1 or 'weight_factor' is negative.
    """
    if not isinstance(length, int):
        raise TypeError("Length must be an integer.")

    if length < 1:
        raise ValueError("Length must be a positive integer.")

    if not isinstance(weight_factor, (float, int)):
        raise TypeError("Weight factor must be a float or an integer.")

    if weight_factor < 0:
        raise ValueError("Weight factor must be a non-negative number.")

    return 1 + (length - 1) * weight_factor

In [5]:
class TestCalculateWeight(unittest.TestCase):
    def test_single_word(self):
        """Test weight calculation for a single word."""
        self.assertEqual(calculate_weight(1), 1)

    def test_pair(self):
        """Test weight calculation for a pair of words."""
        self.assertEqual(calculate_weight(2), 1.5)

    def test_triple(self):
        """Test weight calculation for a triple of words."""
        self.assertEqual(calculate_weight(3), 2)

    def test_longer_sequence(self):
        """Test weight calculation for a longer sequence."""
        self.assertEqual(calculate_weight(5), 3)

# This will run the tests
unittest.main(argv=[''], exit=False)

.......
----------------------------------------------------------------------
Ran 7 tests in 0.006s

OK


<unittest.main.TestProgram at 0x2290f39cd90>

#### Function `find_key_by_value`

- **Inputs**: A string and a dictionary.
- **Process**:
  1. **Lowercase Conversion**: Converts the input string to lowercase for uniformity.
  2. **Tokenization and Punctuation Removal**: Uses NLTK's `word_tokenize` to split the string into words, removing non-alphanumeric characters.
  3. **Stop Words Removal**: Filters out common stop words (e.g., 'and', 'the', 'is') using NLTK's stopwords list.
  4. **Determine Maximum Sequence Length**: Calls `find_max_sequence_length` to get the longest word sequence in the dictionary values.
  5. **Generate Word Combinations**: Dynamically creates combinations of words (singles, pairs, triples, etc.) up to the maximum length.
  6. **Search and Weight Matches in Dictionary**: For each combination, checks for matches in the dictionary values. Uses `calculate_weight` to apply appropriate weighting to each match.
  7. **Count and Identify Most Common Key**: Uses a `Counter` to tally occurrences of each key, considering the weights, and identifies the most common key.
- **Return**: The most common key if matches exist; otherwise, `None`.

In [6]:
def find_key_by_value(string, dictionary):
    # Convert string to lowercase
    string = string.lower()
    
    # Remove punctuation
    words = nltk.word_tokenize(string)
    words = [word for word in words if word.isalnum()]
    
    # Remove stop words
    stop_words = set(stopwords.words("english"))
    words = [word for word in words if word not in stop_words]

    # Determine the maximum sequence length from the dictionary
    max_length = find_max_sequence_length(dictionary)

    # Generate combinations of words up to the maximum length
    elements = []
    for i in range(1, max_length + 1):
        elements += [(i, " ".join(words[j:j+i])) for j in range(len(words) - i + 1)]

    matching_keys = []
    for length, element in elements:
        for key, value in dictionary.items():
            if element in value:
                weight = calculate_weight(length)
                matching_keys += [key] * int(weight)  # Apply weighting function

    counter = Counter(matching_keys)
    if counter:
        most_common = counter.most_common(1)
        return most_common[0][0]
    
    return None

In [7]:
class TestFindKeyByValue(unittest.TestCase):
    def setUp(self):
        # Setup a sample dictionary for testing
        self.sample_dict = {
            'Environment': ['rainforest', 'climate change'],
            'Technology': ['AI', 'machine learning'],
            'Sports': ['football', 'teamwork']
        }

    def test_match_found(self):
        """Test a case where the match is found."""
        result = find_key_by_value("Advancements in machine learning", self.sample_dict)
        self.assertEqual(result, 'Technology')

    def test_no_match_found(self):
        """Test a case where no match is found."""
        result = find_key_by_value("This is an unmatched string", self.sample_dict)
        self.assertIsNone(result)

    def test_empty_string(self):
        """Test with an empty string."""
        result = find_key_by_value("", self.sample_dict)
        self.assertIsNone(result)

    def test_empty_dictionary(self):
        """Test with an empty dictionary."""
        result = find_key_by_value("Some string", {})
        self.assertIsNone(result)

# This will run the tests
unittest.main(argv=[''], exit=False)

...........
----------------------------------------------------------------------
Ran 11 tests in 0.050s

OK


<unittest.main.TestProgram at 0x2290f3c64c0>

This methodology enhances text processing by removing extraneous elements and considering context through word combinations. It dynamically adapts to the dictionary's complexity and prioritizes longer, more meaningful word sequences in identifying the most relevant topic or category in the text. This is especially useful in tasks like keyword extraction and categorization.

## Example: Dictionary <a id="Example-Dictionary"></a>

### Topic-Specific Keyword Dictionary

In our text analysis task, we use a specialized dictionary `my_dictionary` that contains key-value pairs for different topics. This dictionary is structured to help identify the central theme of a given text based on keyword and phrase matches. Each key in this dictionary represents a distinct topic, and the associated value is a list of keywords and key phrases relevant to that topic.

- **`Environment`**: This key represents texts related to environmental issues. The associated value is a list of keywords such as 'rainforest', 'species', and phrases like 'climate change', 'diverse species'. These terms are specifically chosen to capture the essence of environmental discussions, focusing on biodiversity, ecological issues, and climate concerns.

- **`Technology`**: Under this key, we group texts that discuss technological advancements. The keywords and phrases here include 'AI' (Artificial Intelligence), 'computing', 'machine learning', and 'advancements in'. These terms are pivotal in capturing discussions around modern technological innovations and trends, especially in the field of computing and AI.

- **`Sports`**: This key is dedicated to sports-related texts. The list of keywords includes 'football', 'match', and phrases like 'teamwork', 'football match'. These are common terms in sports discussions, especially related to football, highlighting aspects of gameplay, team dynamics, and match events.

By using this dictionary, we can analyze a text and determine which topic it most likely pertains to based on the occurrence of these predefined keywords and phrases. This approach is particularly useful in categorizing texts and extracting topic-specific insights.

In [8]:
# Dictionary with key-value pairs for each topic
my_dictionary = {
    'Environment': ['rainforest', 'species', 'climate change', 'diverse species'],
    'Technology': ['AI', 'computing', 'machine learning', 'advancements in'],
    'Sports': ['football', 'match', 'teamwork', 'football match']
}

## Example: Individual Strings <a id="Example-Individual-Strings"></a>

### Preparing Test Strings for Topic Analysis

To demonstrate the effectiveness of our text processing and categorization approach, we prepare a set of test strings. Each string is carefully crafted to represent a specific topic, corresponding to the keys in our `my_dictionary`. These strings serve as examples to showcase how our algorithm identifies the relevant topic based on the keywords and phrases in the text.

- **`environment_string`**: This string focuses on environmental issues, specifically highlighting the rainforest's biodiversity and the threat of climate change. The sentence "The rainforest is home to diverse species. Climate change threatens this habitat." is designed to include keywords like 'rainforest', 'diverse species', and 'climate change', which are pivotal for the 'Environment' category in our dictionary.

- **`technology_string`**: Aimed at the theme of technology, this string encapsulates key elements of modern tech discourse. "Advancements in AI and computing are transforming our world. Machine learning is key." includes terms such as 'AI', 'computing', and 'machine learning', aligning it with the 'Technology' category in our dictionary.

- **`sports_string`**: This string is all about sports, with a focus on football. "The football match was thrilling. Teamwork and strategy led to victory." includes specific references to a 'football match' and general sports themes like 'teamwork', making it a perfect fit for the 'Sports' category in our dictionary.

By analyzing these strings using our `find_key_by_value` function and `my_dictionary`, we can effectively demonstrate how our text categorization system works, identifying the most relevant topic for each string based on its content.


In [9]:
# Test Strings
environment_string = "The rainforest is home to diverse species. Climate change threatens this habitat."
technology_string = "Advancements in AI and computing are transforming our world. Machine learning is key."
sports_string = "The football match was thrilling. Teamwork and strategy led to victory."

### Application of Text Categorization Function

After setting up our test strings and keyword dictionary, the next step is to apply the `find_key_by_value` function to each string. This function analyzes the content of the strings and matches them with the most relevant topic based on our predefined keyword dictionary. The process and results are as follows:

- **Applying to `environment_string`**: 
  - We pass the `environment_string` and `my_dictionary` to the `find_key_by_value` function. This string, themed around environmental issues, is analyzed to identify the most relevant topic based on the keywords it contains.
  - The result is stored in `environment_result`.

- **Applying to `technology_string`**: 
  - Similarly, the `technology_string` is processed using the same function. This string focuses on technology-related topics and is evaluated to find its best-matching category in the dictionary.
  - The outcome is captured in `technology_result`.

- **Applying to `sports_string`**: 
  - The `sports_string`, which revolves around a sports theme, particularly football, is also analyzed using the function. The aim is to determine its corresponding topic from the dictionary.
  - This result is saved in `sports_result`.

- **Printing the Results**:
  - Finally, we print out the results for each string. This showcases which topic (Environment, Technology, or Sports) has been identified as the most relevant for each respective string based on the presence of specific keywords and phrases.
  - The print statements display the outcomes in a format like "Environment String Result: [Topic]".

By executing this code, we can observe the effectiveness of our keyword-based text categorization approach, demonstrating how the algorithm matches each string with the most fitting topic from our dictionary.


In [10]:
# Applying the function to each test string
environment_result = find_key_by_value(environment_string, my_dictionary)
technology_result = find_key_by_value(technology_string, my_dictionary)
sports_result = find_key_by_value(sports_string, my_dictionary)

# Printing the results
print("Environment String Result:", environment_result)
print("Technology String Result:", technology_result)
print("Sports String Result:", sports_result)

Environment String Result: Environment
Technology String Result: Technology
Sports String Result: Sports


## Example:Data Frame <a id="Example-Data-Frame"></a>

### Creating a DataFrame with Test Strings for Topic Analysis

To effectively demonstrate our text categorization algorithm, we first create a pandas DataFrame. This DataFrame will contain a series of test strings, each specifically crafted to represent a different topic. This setup is essential for testing our algorithm's ability to accurately categorize text based on predefined keywords.

In this below code snippet:

- **Data Dictionary**: A dictionary named `data` is created with a key `text`. The value associated with this key is a list of strings, each a sentence relevant to a specific topic (Environment, Technology, and Sports).

- **DataFrame Creation**: Using `pd.DataFrame(data)`, we convert this dictionary into a pandas DataFrame named `df`. This DataFrame now holds our test strings in a structured format, ideal for applying our text categorization function.

The DataFrame `df` will serve as the foundation for applying our `find_key_by_value` function, allowing us to test and showcase the algorithm's capability to discern and categorize the topic of each string based on its content.

In [11]:
# Create a DataFrame with test strings
data = {
    'text': [
        "The rainforest is home to diverse species. Climate change threatens this habitat.",
        "Advancements in AI and computing are transforming our world. Machine learning is key.",
        "The football match was thrilling. Teamwork and strategy led to victory."
    ]
}

df = pd.DataFrame(data)

### Applying the Categorization Function to DataFrame

After preparing our DataFrame with test strings, the next crucial step is to apply the `find_key_by_value` function to each row. This process involves analyzing the text in each row and categorizing it based on our predefined keyword dictionary. We also create a new column in the DataFrame to store these categorization results.

This code does the following:

- **Function Application**: The `apply` method is used on the 'text' column of the DataFrame `df`. For each row, the `find_key_by_value` function is called with the text as input and `my_dictionary` for keyword reference. This function determines the most relevant topic for each text string.

- **Creating a New Column**: The results of the function application (i.e., the identified topics) are stored in a new column in the DataFrame named 'topic'.

- **Displaying Results**: Finally, the updated DataFrame is printed. This DataFrame now includes both the original text strings and the corresponding topics identified by the algorithm.

By executing this code, we demonstrate the capability of our text categorization algorithm to analyze each string and assign a relevant topic from our keyword dictionary. The result is a DataFrame that not only contains the original text but also a categorization of each text, reflecting the primary theme or topic it represents.

In [12]:
# Apply the function to each row and create a new column with the results
df['topic'] = df['text'].apply(lambda x: find_key_by_value(x, my_dictionary))

# Display the DataFrame
df

Unnamed: 0,text,topic
0,The rainforest is home to diverse species. Climate change threatens this habitat.,Environment
1,Advancements in AI and computing are transforming our world. Machine learning is key.,Technology
2,The football match was thrilling. Teamwork and strategy led to victory.,Sports
