# Labor II. - Text Cleaning

## [Huggingface](https://huggingface.co/docs/datasets/index)

- IMDB dataset: hf://datasets/scikit-learn/imdb/IMDB Dataset.csv

In [3]:
import pandas as pd

imdb_dataset = pd.read_csv("hf://datasets/scikit-learn/imdb/IMDB Dataset.csv")
imdb_dataset.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Whitespace kezelése (Handling Whitespaces)

Magyar: A felesleges szóközök, tabulátorok, sortörések eltávolítása vagy normalizálása, hogy a szöveg homogén legyen.

English: Removing or normalizing unnecessary spaces, tabs, line breaks, so that the text is homogeneous.

In [None]:
# prompt: Write me a python which show that how can i remove the white space from a simple text

text = "  This is a text with   extra   spaces.  "

# Remove leading and trailing whitespace
text = text.strip()

# Replace multiple spaces with single spaces
text = " ".join(text.split())

print(text)

In [4]:
# prompt: Write me a python which apply the white space removing on the imdb_dataset datafram review column and use progresbar sepcific application.

from tqdm.notebook import tqdm

def remove_extra_whitespace(text):
  """Removes leading/trailing whitespace and replaces multiple spaces with single spaces."""
  text = text.strip()
  text = " ".join(text.split())
  return text

# Apply the function to the 'review' column with a progress bar
imdb_dataset['review'] = [remove_extra_whitespace(review) for review in tqdm(imdb_dataset['review'], desc="Cleaning reviews")]


Cleaning reviews:   0%|          | 0/50000 [00:00<?, ?it/s]

## Speciális karakterek eltávolítása (Removing Special Characters)

Magyar: Speciális karakterek, mint például @, #, %, eltávolítása, amelyek általában nem relevánsak a szöveg értelmezésében.

English: Removing Special Characters: Removing special characters such as @, #, %, which are generally not relevant to the interpretation of the text.

In [5]:
# prompt: Write me a code which remove the special characters from the imdb_dataset's review column.

import re

def remove_special_characters(text):
  """Removes special characters from the text."""
  text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
  return text

# Apply the function to the 'review' column with a progress bar
imdb_dataset['review'] = [remove_special_characters(review) for review in tqdm(imdb_dataset['review'], desc="Removing special characters")]


Removing special characters:   0%|          | 0/50000 [00:00<?, ?it/s]

## HTML címkék eltávolítása (Removing HTML Tags)

Magyar: Webes szövegek esetén a HTML tag-ek eltávolítása, amelyek nem tartoznak a tényleges szöveghez.

English: In the case of web texts, removing HTML tags that do not belong to the actual text.

In [6]:
# prompt: Write me a code which remove the html tags from the imdb_dataset's review column.

from bs4 import BeautifulSoup

def remove_html_tags(text):
  """Removes HTML tags from the text."""
  soup = BeautifulSoup(text, "html.parser")
  return soup.get_text()

# Apply the function to the 'review' column with a progress bar
imdb_dataset['review'] = [remove_html_tags(review) for review in tqdm(imdb_dataset['review'], desc="Removing HTML tags")]


Removing HTML tags:   0%|          | 0/50000 [00:00<?, ?it/s]

## Kontrakciók kibontása (Expanding Contractions)

Magyar: Az olyan rövidítések kibontása, mint a "don't" → "do not", hogy egyértelműbb legyen a szöveg jelentése.

English: Expanding abbreviations such as "don't" → "do not" to make the meaning of the text clearer.

In [7]:
# prompt: Write me an python example which present me the expanding contractions on the simple text.

import re

def expand_contractions(text):
  """Expands contractions in the text."""
  contractions = {
      "don't": "do not",
      "can't": "cannot",
      "won't": "will not",
      "shouldn't": "should not",
      "I'm": "I am",
      # Add more contractions as needed
  }

  for contraction, expansion in contractions.items():
    text = re.sub(contraction, expansion, text)
  return text


text = "I don't know what I'm doing, but I can't stop."
expanded_text = expand_contractions(text)
print(f"Original text: {text}")
print(f"Expanded text: {expanded_text}")


Original text: I don't know what I'm doing, but I can't stop.
Expanded text: I do not know what I am doing, but I cannot stop.


## Ékezetek és diakritikus jelek eltávolítása (Removing Accents and Diacritics)

Magyar: Az ékezetes és diakritikus jelek eltávolítása vagy normalizálása, például „á” → „a”, hogy egységesebb legyen a szöveg.

English: Removing or normalizing accents and diacritics, such as "á" → "a", to make the text more consistent.

In [8]:
# prompt: Write me a sample pyhton code about the accents and diacritics removing. Use the example a hungarian text.

import unicodedata

def remove_accents(text):
  """Removes accents and diacritics from the text."""
  text = unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode('ASCII')
  return text

hungarian_text = "Ez egy példa magyar szövegre, amely ékezetes betűket tartalmaz."
text_without_accents = remove_accents(hungarian_text)

print(f"Original text: {hungarian_text}")
print(f"Text without accents: {text_without_accents}")


Original text: Ez egy példa magyar szövegre, amely ékezetes betűket tartalmaz.
Text without accents: Ez egy pelda magyar szovegre, amely ekezetes betuket tartalmaz.


## Szólisták használata (Using Wordlists)

Magyar: Olyan speciális szólisták használata, amelyek alapján kiszűrhetőek bizonyos nem kívánt szavak vagy szószerkezetek.

English: Using special wordlists that filter out certain unwanted words or word structures.

In [9]:
# prompt: Show me that how to work the wordlist in the text cleaning. Use the python programing language. Use the spanish language in the example. The wordlist includes the following words: bueno, maestra, excuela, universidad, correo electronico

import pandas as pd
from tqdm.notebook import tqdm
import re
from bs4 import BeautifulSoup
import unicodedata

# ... (Previous code from the provided context) ...

def remove_words_from_wordlist(text, wordlist):
  """Removes words from the text that are present in the wordlist."""
  for word in wordlist:
    text = text.replace(word, "")
  return text

# Example wordlist in Spanish
spanish_wordlist = ["bueno", "maestra", "escuela", "universidad", "correo electronico"]

# Example text in Spanish
spanish_text = "La maestra de la escuela es muy bueno, y tiene un correo electronico para la universidad."

# Remove words from the wordlist
cleaned_spanish_text = remove_words_from_wordlist(spanish_text, spanish_wordlist)

print(f"Original text: {spanish_text}")
print(f"Text without words from wordlist: {cleaned_spanish_text}")

Original text: La maestra de la escuela es muy bueno, y tiene un correo electronico para la universidad.
Text without words from wordlist: La  de la  es muy , y tiene un  para la .
