# Lowercase

An important first step in working with text data is simply converting it into lowercase. Why do we do this? Well, it helps maintain consistency in our data and our output. When we're working with text, be that exploratory analysis or machine learning, we want to ensure words are understood and counted as the same word, your model might treat a word with a capital letter different from the same word  without any capital letter. Lowercasing ensures conformity.

It also make it easier to continue with additonal cleaning of the data as we donâ€™t have to account for different cases.

However, do remember that lowercasing can change the meaning of some text e.g "US" in uppercase is understood as a country, as opposed to "us".

Let's take a look at how easy it is to convert our data to lowercase using python's built in lower() function.

In [3]:
sentence = "My Name is Soubhik"
lowercase_sentence = sentence.lower()
print(lowercase_sentence)

my name is soubhik


In [4]:
sentence_list = ["My Name is Soubhik", "I live in India"]
lowercase_sentences = [x.lower() for x in sentence_list]
print(lowercase_sentences)

['my name is soubhik', 'i live in india']


# Stopwords

In this lesson we'll be using the nltk package to remove stop words from text.

Stop words are common words in the language which don't carry much meaning e.g. "and", "of", "a", "to".

We remove these words because it removes a lot of complexity from the data. These words don't add much meaning to text so by removing them we are left with a smaller, cleaner dataset. Smaller, cleaner datasets often lead to increased accuracy in machine learning and will also speed up processing times.

In [5]:
#Removing stop words. example : and, to, in, the, of, a, an, is, this > use Naural Language Toolkit (nltk) library

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

english_stopwords = stopwords.words('english')
print(english_stopwords)



['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/soubhik/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
sentence = "This is a sample sentence, not showing off the stop words filtration."
sentence_no_stopwords = ' '.join([word for word in sentence.split() if word.lower() not in english_stopwords])
print(sentence_no_stopwords)

sample sentence, showing stop words filtration.


In [7]:
english_stopwords.remove('did')
english_stopwords.remove('not')

In [8]:
english_stopwords.append('go')
sentence_no_stopwords_custom = ' '.join([word for word in sentence.split() if word.lower() not in english_stopwords])
print(sentence_no_stopwords_custom)

sample sentence, not showing stop words filtration.


# Regular Expressions

Regular expressions, or "regex" for short, is a special syntax for searching for strings that meets a specified pattern. It's a great tool to filter and sort through text when you want to match patterns rather than a hard coded string or strings.

There are loads of options for the syntax so it's best to just jump in and get started with some examples.

In [9]:
import re

# Raw Strings

Python recognises certain characters to have a special meaning, for example, \n in python is used to indicate a new line. However, sometimes these codes that python recognises to have certain meanings appear in our strings and we want to tell python that a \n in our text is a literal \n, rather than meaning a new line.

We can use the 'r' character before strings to indicate to python that our text is what is known as a "raw string".

In [10]:
my_folder = r"C:\my_data\notes"
print(my_folder)

C:\my_data\notes


In [11]:
result_search = re.search("pattern", r"string to contain the pattern")
print(result_search)

<re.Match object; span=(22, 29), match='pattern'>


In [12]:
result_search = re.search("pattern", r"sample string without the special word")
print(result_search)

None


In [13]:
string = r"sara was able to help me find the items I needed quickly"
new_string = re.sub(r"sara", "Soubhik", string)
print(new_string)

Soubhik was able to help me find the items I needed quickly


In [14]:
customer_review = [ " sam was a great help to me in the store", "the cashier was very rude to him, I think her name was sara", "I found everything I needed, thanks to the assistance from sara", "amazing work from sadeen!", "not happy with the service, will not come back!" , "sarah? and sam were both very helpful", "i had to wait too long for assistance want" ]

sarahs_reviews = []

pattern_to_find = r"sarah?"

for string in customer_review:
    if re.search(pattern_to_find, string):
        sarahs_reviews.append(string)

print(sarahs_reviews)


['the cashier was very rude to him, I think her name was sara', 'I found everything I needed, thanks to the assistance from sara', 'sarah? and sam were both very helpful']


In [15]:
a_review = []
pattern_to_find = r"^a"

for string in customer_review:
    if re.search(pattern_to_find, string):
        a_review.append(string)
print(a_review)

['amazing work from sadeen!']


In [16]:
y_review = []
pattern_to_find = r"y$"

for string in customer_review:
    if re.search(pattern_to_find, string):
        y_review.append(string)
print(y_review)

[]


In [17]:
needwant_reviews = []
pattern_to_find = r"\b(need|want)(s|ed)?\b"

for string in customer_review:
    if re.search(pattern_to_find, string):
        needwant_reviews.append(string)
print(needwant_reviews)

['I found everything I needed, thanks to the assistance from sara', 'i had to wait too long for assistance want']


In [18]:
no_punctuation_reviews = []
pattern_to_find = r"[^\w\s]"
for string in customer_review:
    no_punct_string = re.sub(pattern_to_find, "", string)
    no_punctuation_reviews.append(no_punct_string)
print(no_punctuation_reviews)

[' sam was a great help to me in the store', 'the cashier was very rude to him I think her name was sara', 'I found everything I needed thanks to the assistance from sara', 'amazing work from sadeen', 'not happy with the service will not come back', 'sarah and sam were both very helpful', 'i had to wait too long for assistance want']
