# CleanText

CleanText is an open-source python package (common for almost every package we see) specifically for cleaning raw data (as the name suggests and I believe you might have guessed).

Simple, easy to use package with minimalistic code to write with a ton of features to leverage (we all want that, right?). So there are two methods (yeah, mainly there are only two in this case), namely:

> * clean: perform cleaning on raw text and then return the cleaned text in the form of a string.
> * clean_words: same as above, cleaning raw text but will return a list of clean words (even better )

The beautiful thing about the CleanText package is not the amount of operations it supports but how easily you can use them. A list of those are mentioned below, and we’ll later write some code showcasing all of that for better understanding.

To read about it more, please refer [this](https://analyticsindiamag.com/guide-to-cleantext-a-python-package-to-clean-raw-text-data/) article.

# Code Implementation of CleanText

## Installation

CleanText package requires Python3 and NLTK for execution. 

For installing using pip, use the following command.

In [None]:
!python -m pip install pip --upgrade --user -q --no-warn-script-location
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn nltk gensim cleantext --user -q --no-warn-script-location


In [None]:
from cleantext import clean

# Unicode

Text having letters encoded with Unicode characters, different Unicode for different letters. There are different encodings such as UTF-8, UTF-32 and so on. 

In [None]:
s1 = 'Zürich'

In [None]:
clean(s1, fix_unicode=True)

Notice the ‘u’ has been encoded and we have to convert it into a normal character described by ASCII as the former will not be recognised as an English Language letter and will be discarded.

This may be the case with many such words, which are included from different languages in English.

# Closest ASCII representation


Abbreviated from American Standard Code for Information Interchange, this is a character encoding just like Unicode. They are used for representing text in computers and telecommunications equipment. This is to create a standard for character sets so that different devices can communicate with each other.

In [None]:
s2 = "ko\u017eu\u0161\u010dek"

In [None]:
clean(s2, to_ascii=True)

As you can see, the present text is untouched, and the encoding in our text has been converted successfully to text. This happens with data when doing NLP 
tasks; hence this is a useful operation that can be easily performed.


# Lower Case

Uppercase and Lowercase letters are considered different; hence, we must change them to lowercase (preferably). While understanding the text to make meaning out if it, this hardly matters hence should be performed.

In [None]:
s3 = "My Name is YourFullName"

In [None]:
clean(s3, lower=True)

# Replace URLs

Many times we encounter situations where we have to replace URLs with some other particular string. Usually, this requires complex Regex expressions (I hate them), the solution to this is shown below.

In [None]:
s4 = "https://www.Google.com has surpassed https://www.Bing.com in search volume"

In [None]:
clean(s4, no_urls=True, replace_with_url="URL")

# Replace Currency

We also encounter cases when there are currency symbols in our text; we can either remove them completely(nope, won’t help) or replace them with text (which is so better). Below is an example, using Rupee, which is the standard currency in India.

In [None]:
s5 = "I want ₹ 40"

In [None]:
clean(s5, 
      no_currency_symbols = True)

In [None]:
clean(s5, 
      no_currency_symbols = True, 
      replace_with_currency_symbol="Rupees")

# Replace punctuations

This is undoubtedly the most useful operation we require while handling language-related tasks.

These don’t add any value to any tasks we perform on the text dump we have.

In [None]:
s6 = "40,000 is greater than 30,000"

In [None]:
clean(s6, no_punct = True)

In [None]:
clean(s6, no_punct = True, replace_with_punct = "7")

# Remove Numbers

Another important operation or manipulation on the text data which is vital as this will not add any semantic or syntactic value.

In [None]:
s7 = 'abc123def456ghi789zero0'

In [None]:
clean(s7, no_digits = True)

In [None]:
clean(s7, no_digits = True, replace_with_digit="")

# Combining it all

In [None]:
s8 = """
Zürich has a famous website https://www.zuerich.com/ 
WHICH ACCEPTS 40,000 € and adding a random string, :
abc123def456ghi789zero0 for this demo. ' 
     """

In [None]:
clean(s8, 
      fix_unicode=True, 
      to_ascii=True, 
      lower=True, 
      no_urls=True, 
      no_numbers=True, 
      no_digits=True, 
      no_currency_symbols=True, 
      no_punct=True, 
      replace_with_punct="", 
      replace_with_url="<URL>", 
      replace_with_number="<NUMBER>", 
      replace_with_digit="", 
      replace_with_currency_symbol="<CUR>")