# Lyrics Text Normalization

This script reads an Excel file containing lyrics data, normalizes the text in the lyrics, and saves the modified data back to an Excel file.

## Steps

1. **Read the Excel file**: The pandas library is used to read the Excel file containing the lyrics data. The file path is specified in the code.

2. **Normalize text**: A function `normalize_text` is defined to normalize the text in the lyrics. This function converts the text to lowercase, normalizes whitespace, and keeps only certain punctuation marks. It uses the `unicodedata` library to categorize characters.

3. **Apply the normalization function**: The `normalize_text` function is applied to the 'Lyrics' column of the DataFrame. The normalized texts are stored in a new column 'NormalizedText'.

4. **Save the modified DataFrame**: The modified DataFrame, which now includes the normalized text, is saved to an Excel file. The output file path is specified in the code.


In [1]:
import pandas as pd
import unicodedata

# Read the Excel file
df = pd.read_excel('../Chara-Based-Model/DataSets/Lyrics Training Data GPT Generated.xlsx', engine='openpyxl')

# Normalize text
def normalize_text(texts):
    normalized_texts = []
    keep_punctuation = {"'", "-", "’"}  # Add or remove characters as needed
    for text in texts:
        try:
            text = str(text).lower()  # Convert to lowercase and ensure text is string
            text = ' '.join(text.split())  # Normalize whitespace
            # Iterate over each character in the text
            # and include it in the output if it meets certain conditions
            text = ''.join(
                char for char in text 
                if unicodedata.category(char)[0] in ('L', 'N', 'Z')  # Check if the character is a letter, number, or space
                or char in keep_punctuation  # Or if the character is in the custom set of punctuation marks to keep
            )

            normalized_texts.append(text)
        except Exception as e:
            print(f"Error processing text: {text} with error {e}")
            normalized_texts.append(text)  # Append original text or handle as needed
    return normalized_texts


# Normalize text
normalized_texts = normalize_text(df['Lyrics'].astype(str))

# Add the normalized text to the dataframe
df['NormalizedText'] = normalized_texts

# Save the modified DataFrame
output_path = 'normalized_text.xlsx'  # Replace with a path that is not synced
df.to_excel(output_path, engine='openpyxl', index=False)


# Text Processing and Label Binarization in Python

This script reads an Excel file containing normalized text data, processes the text, converts language labels into a binary matrix, and writes the processed data back to a new Excel file.

## Libraries Used
- pandas: For data manipulation and analysis.
- sklearn.preprocessing: Provides a utility class MultiLabelBinarizer for transforming multiclass labels to binary labels.

## Steps
1. The required libraries are imported.
2. The normalized Excel file is read into a pandas DataFrame.
3. The 'Languages' column of the DataFrame is processed to convert the comma-separated string of languages into a list of languages.
4. The MultiLabelBinarizer is initialized.
5. The language labels are converted into a binary matrix.
6. A new DataFrame is created for the binary matrix with appropriate column names.
7. The original DataFrame is concatenated with the new binary matrix DataFrame.
8. The resulting DataFrame is written to a new Excel file.

Please ensure that the required libraries are installed in your Python environment before running this script.


In [2]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

# Step 1: Load the dataset
df = pd.read_excel("./normalized_text.xlsx")  # Update the path to where you've stored the file

# Step 2: Process the 'Languages' column to handle multiple labels
# Split the string on commas to get a list of languages for each entry
# Correctly process the 'Langauges' column to ensure it contains lists of languages without extra characters
df['Langauges'] = df['Langauges'].apply(lambda x: [lang.strip() for lang in x.split(',') if lang.strip()])

# Reinitialize MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Reconvert language labels into a binary matrix format
binary_matrix = mlb.fit_transform(df['Langauges'])

# Reconvert the binary matrix back into a DataFrame for easy viewing/manipulation
binary_matrix_df_corrected = pd.DataFrame(binary_matrix, columns=mlb.classes_)

# Inspect the corrected DataFrame
binary_matrix_df_corrected.head()


# Merge the binary matrix with the original DataFrame
df_combined = pd.concat([df, binary_matrix_df_corrected], axis=1)

# Save the combined DataFrame to a new Excel file
output_file_path = "combined_normalized_text.xlsx"  # Specify your desired output file path
df_combined.to_excel(output_file_path, index=False)

print(f"Combined DataFrame saved to {output_file_path}")


Combined DataFrame saved to combined_normalized_text.xlsx


In [3]:
print(df['Langauges'].apply(lambda x: ','.join(sorted(x))).unique())


['AR,EN,FR' 'DE,ES,KO' 'EN,HI,RU' 'EN,ES,PT' 'EN,FR,IT' 'EN,FR' 'AR,FR'
 'EN,KO' 'ES,IT' 'EN,SW' 'DE,EN' 'AR,EN' 'EN,ES' 'EN,IS' 'ES,PT']


In [4]:
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
import unicodedata

# Install the required libraries by running this in your Python environment:
# pip install pandas openpyxl

# Read the Excel file
df = pd.read_excel('/home/ramzidaher/OneDrive/Desktop/[02] University/Third Year/Individual Project/DataSets/Lyrics Training Data GPT Generated.xlsx', engine='openpyxl')

# Assume the text is in a column named 'Text'
texts = df['Lyrics'].tolist()

# Normalize text
def normalize_text(texts):
    normalized_texts = []
    for text in texts:
        text = str(text).lower()  # Convert to lowercase and ensure text is string
        text = ' '.join(text.split())  # Normalize whitespace
        text = ''.join(char for char in text if unicodedata.category(char)[0] in ('L', 'N', 'P', 'Z'))  # Keep letters, numbers, punctuation, and separators
        normalized_texts.append(text)
    return normalized_texts

normalized_texts = normalize_text(texts)

# Assuming the column with text is named 'YourColumnName'
texts = df['Lyrics'].astype(str).tolist()

# Normalize text
normalized_texts = normalize_text(texts)

# Add the normalized text to the dataframe
df['NormalizedText'] = normalized_texts

# Save the dataframe with the normalized text to a new Excel file
df.to_excel('normalized_text.xlsx', engine='openpyxl', index=False)

In [5]:
import sys
print(sys.executable)
# Use this executable path to install TensorFlow
!{sys.executable} -m pip install --upgrade tensorflow


/bin/python3


Defaulting to user installation because normal site-packages is not writeable


In [6]:
!{sys.executable} -m pip install --upgrade --force-reinstall tensorflow

Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow
  Using cached tensorflow-2.15.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (475.2 MB)
Collecting wrapt<1.15,>=1.11.0
  Using cached wrapt-1.14.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77 kB)
Collecting libclang>=13.0.0
  Using cached libclang-16.0.6-py2.py3-none-manylinux2010_x86_64.whl (22.9 MB)
Collecting keras<2.16,>=2.15.0
  Using cached keras-2.15.0-py3-none-any.whl (1.7 MB)
Collecting google-pasta>=0.1.1
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting setuptools
  Using cached setuptools-69.1.0-py3-none-any.whl (819 kB)
Collecting grpcio<2.0,>=1.24.3
  Using cached grpcio-1.60.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1
  Using cached gast-0.5.4-py3-none-any.whl (19 kB)
Collecting termcolor>=1.1.0
  Using c