## 📝 Abbreviation Expansion Pipeline

Abbreviation Expansion Pipeline is a Python class designed for generating n-gram pairs from product descriptions in a DataFrame and suggesting mappings for text expansion. It utilizes Natural Language Processing (NLP) techniques to preprocess text, compute similarity scores, and suggest replacements based on similarity metrics.

Explore the capabilities of the Abbreviation Expansion Pipeline to streamline text analysis tasks and improve the accuracy of abbreviation expansions in your datasets.

### Overview

Abbreviation Expansion Pipeline provides a comprehensive toolkit for processing textual data, identifying n-grams, computing similarity scores between textual elements, and suggesting mappings for abbreviation expansion. It is designed to enhance text analysis workflows by facilitating efficient preprocessing, similarity assessment, and replacement suggestion tasks.

This pipeline supports various functionalities such as:

- Text preprocessing including character trimming and normalization.
- Generation of n-gram pairs from product descriptions.
- Utilization of pre-trained language models for text embedding.
- Calculation of similarity scores using metrics like fuzzy matching and sequence similarity.
- Identification and suggestion of mappings for text expansion based on similarity scores.

Explore the capabilities of the Abbreviation Expansion Pipeline to streamline text processing tasks and improve the accuracy of text expansion mappings.

---

### ✨ Key Features

- **N-gram Generation**: Extracts n-gram pairs from product descriptions to identify potential abbreviation expansions.
- **Text Preprocessing**: Normalizes text by trimming characters and replacing specified patterns.
- **Similarity Assessment**: Computes similarity scores using advanced NLP techniques such as cosine, fuzzy matching and sequence similarity.
- **Mapping Suggestions**: Suggests mappings for text expansion based on the highest similarity scores between textual elements.

---

### 📦 Requirements

- Python 3.x
- pandas
- transformers
- fuzzywuzzy
- nltk
- numpy
- pandarallel (optional for parallel processing)

---

### 🚀 Usage

#### Using `abbrev_expand.py`

```python

import pandas as pd
from abbreviation_expansion_pipeline import AbbreviationExpansionPipeline

# Example DataFrame with product descriptions in "DESC" column
df:pd.DataFrame = pd.read_csv('product_descriptions.csv')

# Create an instance of AbbreviationExpansionPipeline
pipeline = AbbreviationExpansionPipeline(
    dataframe_object=df,
    product_desc_column='DESC',
    ngram=2,
    output_file_name='MINED_KEYWORD_MAPPING',
    hugging_face_model_name='google-bert/bert-base-uncased',
    max_text_length=256,
    cosine_threshold=0.75,
    min_text_match_threshold=85.0,
  ).main()
```

---

```python
import pandas as pd
from abbreviation_expansion_pipeline import AbbreviationExpansionPipeline

# Sample DataFrame
df:pd.DataFrame = pd.DataFrame(data={
    'PROD_DESC': [
      'drink - mix frsh',
      'drink_mix fresh',
      'wine white sparkling brut',
      'wine wht sparkling brut',
      'coffee grnd decf kcup',
      'coffee ground decaf kcup',
    ],
  }
)

# Create an instance of AbbreviationExpansionPipeline
AbbreviationExpansionPipeline(
    dataframe_object=df,
    product_desc_column='PROD_DESC',
    ngram=2,
    output_file_name='BI_GRAM_KEYWORDS_MINING',
    hugging_face_model_name='google-bert/bert-base-uncased',
    max_text_length=256,
    cosine_threshold=0.75,
    min_text_match_threshold=85.0,
  ).main()
```

In [1]:
!git clone https://github.com/sherozshaikh/abbreviation_expansion_pipeline.git

Cloning into 'abbreviation_expansion_pipeline'...
remote: Enumerating objects: 52, done.[K
remote: Counting objects: 100% (52/52), done.[K
remote: Compressing objects: 100% (52/52), done.[K
remote: Total 52 (delta 21), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (52/52), 59.33 KiB | 2.20 MiB/s, done.
Resolving deltas: 100% (21/21), done.


In [2]:
!mv './abbreviation_expansion_pipeline/abbreviation_expansion_pipeline.py' './'

In [3]:
!rm -rf abbreviation_expansion_pipeline

In [4]:
!ls -lsh

total 28K
 24K -rw-r--r-- 1 root root  22K Jul 15 23:52 abbreviation_expansion_pipeline.py
4.0K drwxr-xr-x 1 root root 4.0K Jul 11 13:22 sample_data


In [5]:
import pandas as pd
from abbreviation_expansion_pipeline import AbbreviationExpansionPipeline

All required packages are installed.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
All required packages are installed.
All required packages are imported.
All required packages are installed.
All required packages are imported.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


All required packages are installed.
INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
All required packages are imported.
All required packages are imported.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
def custom_ram_cleanup_func()->None:
  """
  Clean up global variables except for specific exclusions and system modules.

  This function deletes all global variables except those specified in
  `exclude_vars` and variables starting with underscore ('_').

  Excluded variables:
  - Modules imported into the system (except 'sys' and 'os')
  - 'sys', 'os', and 'custom_ram_cleanup_func' itself

  Returns:
  None
  """
  import sys
  all_vars = list(globals().keys())
  exclude_vars = list(sys.modules.keys())
  exclude_vars.extend(['In','Out','_','__','___','__builtin__','__builtins__','__doc__','__loader__','__name__','__package__','__spec__','_dh','_i','_i1','_ih','_ii','_iii','_oh','exit','get_ipython','quit','sys','os','custom_ram_cleanup_func',])
  for var in all_vars:
      if var not in exclude_vars and not var.startswith('_'):
          del globals()[var]
  del sys
  return None


In [7]:
# Sample DataFrame
df:pd.DataFrame = pd.DataFrame(data={
    'PROD_DESC': [
      'drink - mix frsh',
      'drink_mix fresh',
      'wine white sparkling brut',
      'wine wht sparkling brut',
      'coffee grnd decf kcup',
      'coffee ground decaf kcup',
    ],
  }
)

In [8]:
# Create an instance of AbbreviationExpansionPipeline
AbbreviationExpansionPipeline(
    dataframe_object=df,
    product_desc_column='PROD_DESC',
    ngram=2,
    output_file_name='BI_GRAM_KEYWORDS_MINING',
    hugging_face_model_name='google-bert/bert-base-uncased',
    max_text_length=256,
    cosine_threshold=0.75,
    min_text_match_threshold=85.0,
  ).main()

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=6), Label(value='0 / 6'))),))

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=14), Label(value='0 / 14'))),))

Elapsed time: 0.00 minutes
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┃    Nearest Matches Found: 13    ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13), Label(value='0 / 13'))),))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┃   Potential Matches Found: 5    ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=5), Label(value='0 / 5'))),))

Elapsed time: 0.00 minutes


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=5), Label(value='0 / 5'))),))

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=5), Label(value='0 / 5'))),))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┃ Abbreviation Expansion with Examples: 3 ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=3), Label(value='0 / 3'))),))

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=3), Label(value='0 / 3'))),))

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=3), Label(value='0 / 3'))),))

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=3), Label(value='0 / 3'))),))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┃ Unique Abbreviation Expansion Found: 3 ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Elapsed time: 0.34 minutes


In [9]:
!free -h

               total        used        free      shared  buff/cache   available
Mem:            12Gi       1.5Gi       4.3Gi       420Mi       6.9Gi        10Gi
Swap:             0B          0B          0B


In [10]:
!ls -lsh

total 108K
 24K -rw-r--r-- 1 root root  22K Jul 15 23:52 abbreviation_expansion_pipeline.py
4.0K -rw-r--r-- 1 root root  187 Jul 15 23:54 BI_GRAM_KEYWORDS_MINING_Examples.csv
4.0K -rw-r--r-- 1 root root  113 Jul 15 23:54 BI_GRAM_KEYWORDS_MINING_Mapping.csv
 20K -rw-r--r-- 1 root root  19K Jul 15 23:53 doc_mapper.py
 20K -rw-r--r-- 1 root root  18K Jul 15 23:53 embedding.py
4.0K drwxr-xr-x 2 root root 4.0K Jul 15 23:53 __pycache__
4.0K drwxr-xr-x 1 root root 4.0K Jul 11 13:22 sample_data
 28K -rw-r--r-- 1 root root  27K Jul 15 23:53 text_scoring.py


In [11]:
pd.read_csv(filepath_or_buffer='BI_GRAM_KEYWORDS_MINING_Examples.csv',dtype='str',encoding='latin-1')

Unnamed: 0,PROD_DESC1,PROD_DESC1.1,SIMILARITY_SCORE,SUGGESTED_MAPPING
0,decaf kcup,decf kcup,84,{'decf': 'decaf'}
1,coffee grnd,coffee ground,81,{'grnd': 'ground'}
2,wine white,wine wht,75,{'wht': 'white'}


In [12]:
pd.read_csv(filepath_or_buffer='BI_GRAM_KEYWORDS_MINING_Mapping.csv',dtype='str',encoding='latin-1')

Unnamed: 0,TO_REPLACE,REPALCE_WITH,SIMILARITY_SCORE_1,SIMILARITY_SCORE_2
0,decf,decaf,75,95
1,grnd,ground,50,91
2,wht,white,67,89


In [13]:
# Create an instance of AbbreviationExpansionPipeline
AbbreviationExpansionPipeline(
    dataframe_object=df,
    product_desc_column='PROD_DESC',
    ngram=3,
    output_file_name='TRI_GRAM_KEYWORDS_MINING',
    hugging_face_model_name='google-bert/bert-base-uncased',
    max_text_length=256,
    cosine_threshold=0.75,
    min_text_match_threshold=85.0,
  ).main()

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=6), Label(value='0 / 6'))),))

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=10), Label(value='0 / 10'))),))

Elapsed time: 0.00 minutes
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┃    Nearest Matches Found: 10    ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=10), Label(value='0 / 10'))),))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┃   Potential Matches Found: 8    ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=8), Label(value='0 / 8'))),))

Elapsed time: 0.00 minutes


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=8), Label(value='0 / 8'))),))

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=8), Label(value='0 / 8'))),))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┃ Abbreviation Expansion with Examples: 5 ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=5), Label(value='0 / 5'))),))

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=5), Label(value='0 / 5'))),))

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=4), Label(value='0 / 4'))),))

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=4), Label(value='0 / 4'))),))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┃ Unique Abbreviation Expansion Found: 4 ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Elapsed time: 0.33 minutes


In [14]:
pd.read_csv(filepath_or_buffer='TRI_GRAM_KEYWORDS_MINING_Examples.csv',dtype='str',encoding='latin-1')

Unnamed: 0,PROD_DESC1,PROD_DESC1.1,SIMILARITY_SCORE,SUGGESTED_MAPPING
0,drink mix fresh,drink mix frsh,86,{'frsh': 'fresh'}
1,white sparkling brut,wht sparkling brut,81,{'wht': 'white'}
2,wine white sparkling,wine wht sparkling,79,{'wht': 'white'}
3,coffee grnd decf,coffee ground decaf,78,"{'grnd': 'ground', 'decf': 'decaf'}"
4,grnd decf kcup,ground decaf kcup,78,"{'grnd': 'ground', 'decf': 'decaf'}"


In [15]:
pd.read_csv(filepath_or_buffer='TRI_GRAM_KEYWORDS_MINING_Mapping.csv',dtype='str',encoding='latin-1')

Unnamed: 0,TO_REPLACE,REPALCE_WITH,SIMILARITY_SCORE_1,SIMILARITY_SCORE_2
0,decf,decaf,75,95
1,frsh,fresh,75,95
2,grnd,ground,50,91
3,wht,white,67,89


In [16]:
!free -h

               total        used        free      shared  buff/cache   available
Mem:            12Gi       1.5Gi       4.7Gi       2.0Mi       6.5Gi        10Gi
Swap:             0B          0B          0B


In [17]:
!ls -lsh

total 116K
 24K -rw-r--r-- 1 root root  22K Jul 15 23:52 abbreviation_expansion_pipeline.py
4.0K -rw-r--r-- 1 root root  187 Jul 15 23:54 BI_GRAM_KEYWORDS_MINING_Examples.csv
4.0K -rw-r--r-- 1 root root  113 Jul 15 23:54 BI_GRAM_KEYWORDS_MINING_Mapping.csv
 20K -rw-r--r-- 1 root root  19K Jul 15 23:53 doc_mapper.py
 20K -rw-r--r-- 1 root root  18K Jul 15 23:53 embedding.py
4.0K drwxr-xr-x 2 root root 4.0K Jul 15 23:53 __pycache__
4.0K drwxr-xr-x 1 root root 4.0K Jul 11 13:22 sample_data
 28K -rw-r--r-- 1 root root  27K Jul 15 23:53 text_scoring.py
4.0K -rw-r--r-- 1 root root  381 Jul 15 23:54 TRI_GRAM_KEYWORDS_MINING_Examples.csv
4.0K -rw-r--r-- 1 root root  130 Jul 15 23:54 TRI_GRAM_KEYWORDS_MINING_Mapping.csv


In [18]:
custom_ram_cleanup_func()
del custom_ram_cleanup_func