# Data Cleaning

Main Goal: Analyze motivational quotes and generate new ones.
Section Goal: Get motivational quotes, process and saving them.

We will start with the following steps:

1. Getting the data
2. Cleaning the data
3. Organizing the data

The input: 300 Quotes from https://www.oberlo.com/blog/motivational-quotes
The output: Corpus (a collection of text) and Document-term-matrix.


# Getting the data

The quotes will be obtained from the url specified before, if we navigate in the inspector we observe that each quote is between "li> </li"
Then as we have some format issues we erase unnecesary characters.

In [1]:
import requests
from bs4 import BeautifulSoup
import pickle

def url_to_transcript(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find(class_="single-post-content lg:px-10").find_all('ol')]
    quotes = []
    result = []
    for item in text:
        quotes.append(item.split('\n'))
    for group in quotes:
        for quote in group:
            if quote!='':
                end = quote.find('.')
                if end>0:
                    result.append(quote[1:end])
    return result

With the cell bellow we create the array with all the quotes

In [2]:
url = 'https://www.oberlo.com/blog/motivational-quotes'

quotes = url_to_transcript(url)

We create a folder with all the quotes, one quote per file in it.

In [3]:
for i in quotes:
     with open("quotes/quote" + str(quotes.index(i)) + ".txt", "wb") as file:
         pickle.dump(i, file)

In [4]:
# Load pickled files
data = []
for i in quotes:
    with open("quotes/quote" + str(quotes.index(i)) + ".txt", "rb") as file:
        data.append(pickle.load(file))

You can check if the file contains what you expect

In [5]:
for quote in data:
    print("-",quote)

- All our dreams can come true, if we have the courage to pursue them
- The secret of getting ahead is getting started
- I’ve missed more than 9,000 shots in my career
- Don’t limit yourself
- The best time to plant a tree was 20 years ago
- Only the paranoid survive
- It’s hard to beat a person who never gives up
- I wake up every morning and think to myself, ‘how far can I push this company in the next 24 hours
- If people are doubting how far you can go, go so far that you can’t hear them anymore
- “We need to accept that we won’t always make the right decisions, that we’ll screw up royally sometimes – understanding that failure is not the opposite of success, it’s part of success
- Write it
- You’ve gotta dance like there’s nobody watching, love like you’ll never be hurt, sing like there’s nobody listening, and live like it’s heaven on earth
- Fairy tales are more than true: not because they tell us that dragons exist, but because they tell us that dragons can be beaten
- Everythin

# Cleaning the data

Now its time to put the text in lowercase and erase punctuation

In [6]:
import re
import string

def clean_text(text):
    text = text.lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('[‘’“”…]', '', text)
    return text


In [7]:
cleanQuote = []
for quote in data:
    cleanQuote.append(clean_text(quote))
print(cleanQuote)

['all our dreams can come true if we have the courage to pursue them', 'the secret of getting ahead is getting started', 'ive missed more than 9000 shots in my career', 'dont limit yourself', 'the best time to plant a tree was 20 years ago', 'only the paranoid survive', 'its hard to beat a person who never gives up', 'i wake up every morning and think to myself how far can i push this company in the next 24 hours', 'if people are doubting how far you can go go so far that you cant hear them anymore', 'we need to accept that we wont always make the right decisions that well screw up royally sometimes – understanding that failure is not the opposite of success its part of success', 'write it', 'youve gotta dance like theres nobody watching love like youll never be hurt sing like theres nobody listening and live like its heaven on earth', 'fairy tales are more than true not because they tell us that dragons exist but because they tell us that dragons can be beaten', 'everything you can im

# Organizing the Data

Now that we our corpus we save it for further purposes

In [8]:
filename = "corpus.pkl"
outfile = open(filename,'wb')
pickle.dump(cleanQuote,outfile)
outfile.close()

Finally, we will make an array counting each word also for further purposes

In [9]:
import nltk
from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

raw = ' '.join(cleanQuote)
tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
words = tokenizer.tokenize(raw)

stop_words = set(stopwords.words('english'))
stop_words.add('dont')
stop_words.add('youre')
words = [word for word in words if word not in stop_words]

counter = Counter()
counter.update(words)


In [10]:
filename = "data_clean.pkl"
outfile = open(filename,'wb')
pickle.dump(words,outfile)
outfile.close()

In [11]:
filename = "dtm.pkl"
outfile = open(filename,'wb')
pickle.dump(counter,outfile)
outfile.close()