# Lyrics Text Normalization

This script reads an Excel file containing lyrics data, normalizes the text in the lyrics, and saves the modified data back to an Excel file.

## Steps

1. **Read the Excel file**: The pandas library is used to read the Excel file containing the lyrics data. The file path is specified in the code.

2. **Normalize text**: A function `normalize_text` is defined to normalize the text in the lyrics. This function converts the text to lowercase, normalizes whitespace, and keeps only certain punctuation marks. It uses the `unicodedata` library to categorize characters.

3. **Apply the normalization function**: The `normalize_text` function is applied to the 'Lyrics' column of the DataFrame. The normalized texts are stored in a new column 'NormalizedText'.

4. **Save the modified DataFrame**: The modified DataFrame, which now includes the normalized text, is saved to an Excel file. The output file path is specified in the code.


In [1]:
import pandas as pd
import unicodedata

# Read the Excel file
df = pd.read_excel('../Chara-Based-Model/DataSets/Lyrics Training Data GPT Generated.xlsx', engine='openpyxl')

# Normalize text
def normalize_text(texts):
    normalized_texts = []
    keep_punctuation = {"'", "-", "’"}  # Add or remove characters as needed
    for text in texts:
        try:
            text = str(text).lower()  # Convert to lowercase and ensure text is string
            text = ' '.join(text.split())  # Normalize whitespace
            # Iterate over each character in the text
            # and include it in the output if it meets certain conditions
            text = ''.join(
                char for char in text 
                if unicodedata.category(char)[0] in ('L', 'N', 'Z')  # Check if the character is a letter, number, or space
                or char in keep_punctuation  # Or if the character is in the custom set of punctuation marks to keep
            )

            normalized_texts.append(text)
        except Exception as e:
            print(f"Error processing text: {text} with error {e}")
            normalized_texts.append(text)  # Append original text or handle as needed
    return normalized_texts


# Normalize text
normalized_texts = normalize_text(df['Lyrics'].astype(str))

# Add the normalized text to the dataframe
df['NormalizedText'] = normalized_texts

# Save the modified DataFrame
output_path = 'normalized_text.xlsx'  # Replace with a path that is not synced
df.to_excel(output_path, engine='openpyxl', index=False)


# Text Processing and Label Binarization in Python

This script reads an Excel file containing normalized text data, processes the text, converts language labels into a binary matrix, and writes the processed data back to a new Excel file.

## Libraries Used
- pandas: For data manipulation and analysis.
- sklearn.preprocessing: Provides a utility class MultiLabelBinarizer for transforming multiclass labels to binary labels.

## Steps
1. The required libraries are imported.
2. The normalized Excel file is read into a pandas DataFrame.
3. The 'Languages' column of the DataFrame is processed to convert the comma-separated string of languages into a list of languages.
4. The MultiLabelBinarizer is initialized.
5. The language labels are converted into a binary matrix.
6. A new DataFrame is created for the binary matrix with appropriate column names.
7. The original DataFrame is concatenated with the new binary matrix DataFrame.
8. The resulting DataFrame is written to a new Excel file.

Please ensure that the required libraries are installed in your Python environment before running this script.


In [2]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

# Step 1: Load the dataset
df = pd.read_excel("./normalized_text.xlsx")  # Update the path to where you've stored the file

# Step 2: Process the 'Languages' column to handle multiple labels
# Split the string on commas to get a list of languages for each entry
# Correctly process the 'Langauges' column to ensure it contains lists of languages without extra characters
df['Langauges'] = df['Langauges'].apply(lambda x: [lang.strip() for lang in x.split(',') if lang.strip()])

# Reinitialize MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Reconvert language labels into a binary matrix format
binary_matrix = mlb.fit_transform(df['Langauges'])

# Reconvert the binary matrix back into a DataFrame for easy viewing/manipulation
binary_matrix_df_corrected = pd.DataFrame(binary_matrix, columns=mlb.classes_)

# Inspect the corrected DataFrame
binary_matrix_df_corrected.head()


# Merge the binary matrix with the original DataFrame
df_combined = pd.concat([df, binary_matrix_df_corrected], axis=1)

# Save the combined DataFrame to a new Excel file
output_file_path = "combined_normalized_text.xlsx"  # Specify your desired output file path
df_combined.to_excel(output_file_path, index=False)

print(f"Combined DataFrame saved to {output_file_path}")


Combined DataFrame saved to combined_normalized_text.xlsx


In [3]:
print(df['Langauges'].apply(lambda x: ','.join(sorted(x))).unique())


['AR,EN,FR' 'DE,ES,KO' 'EN,HI,RU' 'EN,ES,PT' 'EN,FR,IT' 'EN,FR' 'AR,FR'
 'EN,KO' 'ES,IT' 'EN,SW' 'DE,EN' 'AR,EN' 'EN,ES' 'EN,IS' 'ES,PT']


In [4]:
import pandas as pd
import json

# Load your dataset
df = pd.read_excel('combined_normalized_text.xlsx', engine='openpyxl')
normalized_texts = df['NormalizedText'].tolist()

# Concatenate all normalized texts into one large string to find unique characters
all_text = ''.join(normalized_texts)

# Identify and sort unique characters
unique_chars = sorted(set(all_text))

# Create a mapping from unique characters to indices, adding an 'UNK' (unknown) token for unseen characters
char_to_index = {'UNK': 0}  # Start with 'UNK' token mapped to 0
char_to_index.update({char: index + 1 for index, char in enumerate(unique_chars)})  # Shift indices by 1

# Create a reverse mapping from indices to characters
index_to_char = {index: char for char, index in char_to_index.items()}

# Display the number of unique characters, including the 'UNK' token
print(f"Total unique characters (including 'UNK'): {len(char_to_index)}")

# Optional: Save the mappings to JSON files for future use
with open('char_to_index.json', 'w') as f:
    json.dump(char_to_index, f)

with open('index_to_char.json', 'w') as f:
    json.dump(index_to_char, f)

# The character vocabulary is now created and saved. This includes handling of unseen characters via 'UNK' token.


Total unique characters (including 'UNK'): 660


In [5]:
import pandas as pd
import json

try:
    # Load the normalized text data
    df = pd.read_excel('normalized_text.xlsx', engine='openpyxl')
except FileNotFoundError:
    print("The specified file does not exist.")
    

normalized_texts = df['NormalizedText'].tolist()  # Assuming the column name is 'NormalizedText'

# Concatenate all normalized texts into one large string
all_text = ''.join(normalized_texts)

# Identify and sort unique characters
unique_chars = sorted(set(all_text))

# Create a mapping from unique characters to indices
char_to_index = {char: index for index, char in enumerate(unique_chars, start=1)}  # Start indexing from 1

# Optional: Create a reverse mapping from indices to characters
index_to_char = {index: char for char, index in char_to_index.items()}

# Display the mappings
print("Character to Index Mapping:")
print(char_to_index)
print("\nIndex to Character Mapping:")
print(index_to_char)

# Save the mappings for future use
with open('char_to_index.json', 'w') as f:
    json.dump(char_to_index, f)
with open('index_to_char.json', 'w') as f:
    json.dump(index_to_char, f)


Character to Index Mapping:
{' ': 1, "'": 2, '-': 3, '0': 4, '1': 5, '2': 6, '3': 7, '6': 8, '9': 9, 'a': 10, 'b': 11, 'c': 12, 'd': 13, 'e': 14, 'f': 15, 'g': 16, 'h': 17, 'i': 18, 'j': 19, 'k': 20, 'l': 21, 'm': 22, 'n': 23, 'o': 24, 'p': 25, 'q': 26, 'r': 27, 's': 28, 't': 29, 'u': 30, 'v': 31, 'w': 32, 'x': 33, 'y': 34, 'z': 35, 'ß': 36, 'à': 37, 'á': 38, 'â': 39, 'ã': 40, 'ä': 41, 'æ': 42, 'ç': 43, 'è': 44, 'é': 45, 'ê': 46, 'í': 47, 'î': 48, 'ï': 49, 'ñ': 50, 'ó': 51, 'ô': 52, 'ö': 53, 'ù': 54, 'ú': 55, 'û': 56, 'ü': 57, 'œ': 58, 'а': 59, 'б': 60, 'в': 61, 'г': 62, 'д': 63, 'е': 64, 'ж': 65, 'з': 66, 'и': 67, 'й': 68, 'к': 69, 'л': 70, 'м': 71, 'н': 72, 'о': 73, 'п': 74, 'р': 75, 'с': 76, 'т': 77, 'у': 78, 'х': 79, 'ц': 80, 'ч': 81, 'ш': 82, 'щ': 83, 'ъ': 84, 'ы': 85, 'ь': 86, 'ю': 87, 'я': 88, 'ё': 89, 'ء': 90, 'آ': 91, 'أ': 92, 'إ': 93, 'ئ': 94, 'ا': 95, 'ب': 96, 'ة': 97, 'ت': 98, 'ث': 99, 'ج': 100, 'ح': 101, 'خ': 102, 'د': 103, 'ذ': 104, 'ر': 105, 'ز': 106, 'س': 107, 'ش': 108,

In [6]:
# import pandas as pd

# # Load the normalized text data (make sure to update the path to where your normalized texts are stored)
# df = pd.read_excel('normalized_text.xlsx', engine='openpyxl')
# normalized_texts = df['NormalizedText'].tolist()  # Assuming the column name is 'NormalizedText'

# # Concatenate all normalized texts into one large string
# all_text = ''.join(normalized_texts)

# # Identify and sort unique characters
# unique_chars = sorted(set(all_text))

# # Create a mapping from unique characters to indices
# char_to_index = {char: index for index, char in enumerate(unique_chars, start=1)}  # Start indexing from 1

# # Optional: Create a reverse mapping from indices to characters
# index_to_char = {index: char for char, index in char_to_index.items()}

# # Display the mappings
# print("Character to Index Mapping:")
# print(char_to_index)
# print("\nIndex to Character Mapping:")
# print(index_to_char)
# # Save the mappings for future use
# # You might want to save these mappings to a file or a database, depending on your project needs.


SyntaxError: invalid syntax (826934811.py, line 24)

FileNotFoundError: [Errno 2] No such file or directory: '/home/ramzidaher/OneDrive/Desktop/[02] University/Third Year/Individual Project/DataSets/Lyrics Training Data GPT Generated.xlsx'

In [None]:
import sys
print(sys.executable)
# Use this executable path to install TensorFlow
!{sys.executable} -m pip install --upgrade tensorflow


/bin/python3


Defaulting to user installation because normal site-packages is not writeable


In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall tensorflow

Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow
  Using cached tensorflow-2.15.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (475.2 MB)
Collecting wrapt<1.15,>=1.11.0
  Using cached wrapt-1.14.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77 kB)
Collecting libclang>=13.0.0
  Using cached libclang-16.0.6-py2.py3-none-manylinux2010_x86_64.whl (22.9 MB)
Collecting keras<2.16,>=2.15.0
  Using cached keras-2.15.0-py3-none-any.whl (1.7 MB)
Collecting google-pasta>=0.1.1
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting setuptools
  Using cached setuptools-69.1.0-py3-none-any.whl (819 kB)
Collecting grpcio<2.0,>=1.24.3
  Using cached grpcio-1.60.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1
  Using cached gast-0.5.4-py3-none-any.whl (19 kB)
Collecting termcolor>=1.1.0
  Using c