# Labor I. - Text Cleaning

## [Huggingface](https://huggingface.co/docs/datasets/index)

- IMDB dataset: hf://datasets/scikit-learn/imdb/IMDB Dataset.csv

In [1]:
import pandas as pd

df = pd.read_csv("hf://datasets/scikit-learn/imdb/IMDB Dataset.csv")
df.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Kisbetűsítés (Lowercasing)

Hungary: Az összes karakter kisbetűvé alakítása, hogy elkerüljük az olyan problémákat, mint a nagybetűk és kisbetűk közötti különbségek (pl. "Apple" és "apple").

English: Converting all characters to lowercase to avoid problems such as differences between uppercase and lowercase letters (eg "Apple" and "apple").

In [2]:
dumy = ["Apple","APPLE","appLe"]

for i in dumy:
    print(i.lower())

apple
apple
apple


In [3]:
df["review"] = df["review"].apply(lambda x: x.lower())
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


## Írásjelek eltávolítása (Removing Punctuation)

Hungary: Az írásjelek (pl. pontok, vesszők, kérdőjelek) eltávolítása, mivel ezek gyakran nem relevánsak a szövegelemzés szempontjából.

English: Removing punctuation (e.g. periods, commas, question marks) as they are often not relevant for text analysis.

In [4]:
# prompt: Please generate an example for the punctuation removing and use popular nlp libary.

import nltk
import string

nltk.download('punkt')
from nltk.tokenize import word_tokenize

def remove_punctuation(text):
  """Removes punctuation from a string."""
  translator = str.maketrans('', '', string.punctuation)
  return text.translate(translator)

# Example usage
text = "This is an example sentence! With some punctuation, and other things."
cleaned_text = remove_punctuation(text)
print(cleaned_text)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


This is an example sentence With some punctuation and other things


In [5]:
# we apply it the dataframe's review values.
from tqdm import tqdm

for i in tqdm(range(len(df))):
    df["review"][i] = remove_punctuation(df["review"][i])

df.head()

100%|██████████| 50000/50000 [00:18<00:00, 2702.02it/s]


Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production br br the filmin...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


## Számok eltávolítása (Removing Numbers)

Hungary: A számok eltávolítása, mivel sok esetben nem hordoznak jelentős információt, bár bizonyos esetekben (pl. pénzügyi elemzések) fontosak lehetnek.

English: Removing numbers, as in many cases they do not carry significant information, although in some cases (e.g. financial analyses) they may be important.

In [8]:
# prompt: write me an example code which remove the number a simple text after you apply this my datframe (df)

def remove_numbers(text):
  """Removes numbers from a string."""
  result = ''.join([i for i in text if not i.isdigit()])
  return result

# Example usage
text = "This is an example sentence with 123 numbers."
cleaned_text = remove_numbers(text)
print(cleaned_text)

df["review"] = df["review"].apply(lambda x: remove_numbers(x))
df.head()

This is an example sentence with  numbers.


Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production br br the filmin...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


In [9]:
# Create new `pandas` methods which use `tqdm` progress
tqdm.pandas()

# Now you can use `progress_apply` instead of `apply`
df["review"] = df["review"].progress_apply(lambda x: remove_numbers(x))
df.head()

100%|██████████| 50000/50000 [00:07<00:00, 6890.45it/s]


Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production br br the filmin...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


## Stopword-ek eltávolítása (Removing Stopwords)

Hungary: A gyakran előforduló, de kevés információt hordozó szavak (pl. "a", "és", "de") eltávolítása.

English: Removing words that occur frequently but carry little information (e.g. "a", "and", "but").

In [11]:
# prompt: Please generate an example what remove the stopwords an simple text and apply this for our dataframe

from nltk.corpus import stopwords
nltk.download('stopwords')

def remove_stopwords(text):
  """Removes stopwords from a string."""
  stop_words = set(stopwords.words('english'))
  word_tokens = word_tokenize(text)
  filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
  return " ".join(filtered_sentence)

# Example usage
text = "This is an example sentence with some stopwords."
cleaned_text = remove_stopwords(text)
print(cleaned_text)

example sentence stopwords .


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
tqdm.pandas()

df["review"] = df["review"].progress_apply(lambda x: remove_stopwords(x))
df.head()

100%|██████████| 50000/50000 [01:00<00:00, 827.85it/s]


Unnamed: 0,review,sentiment
0,one reviewers mentioned watching oz episode yo...,positive
1,wonderful little production br br filming tech...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love time money visually stunni...,positive


## Szógyökér képzés (Stemming)

Hungary: A szavak szógyökére való egyszerűsítése, például a „futás”, „futott” és „futni” mind a „fut” szógyökérre alakítása.

English: Simplifying words to their stem, for example, turning "run", "ran" and "to run" into the root "run".

In [14]:
# prompt: please generate an example to steaming for the text process.

from nltk.stem import PorterStemmer

def stem_text(text):
  """Stems words in a string."""
  stemmer = PorterStemmer()
  word_tokens = word_tokenize(text)
  stemmed_sentence = [stemmer.stem(w) for w in word_tokens]
  return " ".join(stemmed_sentence)

# Example usage
text = "This is an example sentence with some running and running."
cleaned_text = stem_text(text)
print(cleaned_text)

thi is an exampl sentenc with some run and run .


In [21]:
tqdm.pandas()

df["review"] = df["review"].progress_apply(lambda x: stem_text(x))
df.head()

100%|██████████| 50000/50000 [02:53<00:00, 287.88it/s]


Unnamed: 0,review,sentiment
0,one review mention watch oz episod youll hook ...,positive
1,wonder littl product br br film techniqu unass...,positive
2,thought wonder way spend time hot summer weeke...,positive
3,basic there famili littl boy jake think there ...,negative
4,petter mattei love time money visual stun film...,positive


## Lemmatizálás (Lemmatization)

Hungary: A szavak alapalakra (lemmára) alakítása, amely figyelembe veszi a szavak szófaját is, például a „futott” szót „fut” alakra, a „gyorsabban” szót „gyors” alakra hozza vissza.

English: Transforming words into basic forms (lemmas), which also takes into account the part of speech of the words, for example, returning the word "ran" to the form "run" and the word "faster" to the form "quick".

In [16]:
# prompt: please generate an example to lemmatization for the text process.

import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

def lemmatize_text(text):
  """Lemmatizes words in a string."""
  lemmatizer = WordNetLemmatizer()
  word_tokens = word_tokenize(text)
  lemmatized_sentence = [lemmatizer.lemmatize(w) for w in word_tokens]
  return " ".join(lemmatized_sentence)

# Example usage
text = "This is an example sentence with some faster and quick."
cleaned_text = lemmatize_text(text)
print(cleaned_text)

[nltk_data] Downloading package wordnet to /root/nltk_data...


This is an example sentence with some faster and quick .


In [22]:
tqdm.pandas()

df["review"] = df["review"].progress_apply(lambda x: lemmatize_text(x))
df.head()

100%|██████████| 50000/50000 [00:54<00:00, 919.51it/s] 


Unnamed: 0,review,sentiment
0,one review mention watch oz episod youll hook ...,positive
1,wonder littl product br br film techniqu unass...,positive
2,thought wonder way spend time hot summer weeke...,positive
3,basic there famili littl boy jake think there ...,negative
4,petter mattei love time money visual stun film...,positive


In [20]:
# prompt: Write an example code that compares steamming and lemmatization with different words.

from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Example words
words = ["running", "better", "cats", "feet", "wolves"]

# Stemming and lemmatization
for word in words:
  stemmed_word = stemmer.stem(word)
  lemmatized_word = lemmatizer.lemmatize(word)
  print(f"Word: {word}, Stemmed: {stemmed_word}, Lemmatized: {lemmatized_word}")


Word: running, Stemmed: run, Lemmatized: running
Word: better, Stemmed: better, Lemmatized: better
Word: cats, Stemmed: cat, Lemmatized: cat
Word: feet, Stemmed: feet, Lemmatized: foot
Word: wolves, Stemmed: wolv, Lemmatized: wolf
