# Data Cleaning
## Cleaning up text data extracted from online

1. Extract data from online
2. Clean the extracted data
3. Organize the data

### The output of this notebook will be:
1. Corpus - A collection of text
2. Document-term matrix - word counts in matrix format



In [10]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
    '''Returns transcript data specifically from scrapsfromtheloft.com.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find(class_="post-content").find_all('p')]
    print(url)
    return text

# URLs of transcripts in scope
urls = ['https://scrapsfromtheloft.com/2020/03/15/mark-zuckerberg-yuval-noah-harari-transcript/']

# Comedian names
comedians = ['mark']

In [11]:
# Request transcripts (takes a few minutes to run)
transcripts = [url_to_transcript(u) for u in urls]

https://scrapsfromtheloft.com/2020/03/15/mark-zuckerberg-yuval-noah-harari-transcript/


In [12]:
# Pickle files for later use

# Make a new directory to hold the text files
!mkdir transcripts

for i, c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "wb") as file:
        pickle.dump(transcripts[i], file)

A subdirectory or file transcripts already exists.


In [13]:
# Load pickled files
data = {}
for i, c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [14]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['mark'])

In [16]:
# More checks
data['mark'][:2]

['Mark Zuckerberg hosts Yuval Noah Harari for a conversation about some big challenges as part of the Facebook CEO’s 2019 series of public discussions about the future of technology in society. The overarching question they debate is: what are we going to do about the systemic problems of the current technological revolution?',
 'Mark Zuckerberg: Hey, everyone. This year I’m doing a series of public discussions on the future of the Internet and society and some of the big issues around that. And today I’m here with Yuval Noah Harari. A great historian and best-selling author of a number of books. His first book, Sapiens: A Brief History of Humankind, chronicled and did an analysis, going from the early days of hunter/gatherer society to now how our civilization is organized. And your next two books, the Homo Deus: A Brief History of Tomorrow and 21 Lessons for the 21st Century, actually tackle important issues of technology and the future. And that’s a lot of what we’ll talk about toda

## Cleaning Data
### Common data cleaning steps on text:
1. Lower case
2. Remove punctuations
3. Remove numbers
4. Remove common non-sensical text
5. Tokenize text
6. Remove stop words

In [19]:
# Make the extracted text into one large chunk of string
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [20]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [21]:
# Put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
mark,Mark Zuckerberg hosts Yuval Noah Harari for a conversation about some big challenges as part of the Facebook CEO’s 2019 series of public discussio...


In [24]:
# Apply text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    
    return text

round1 = lambda x: clean_text_round1(x)

In [25]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
mark,mark zuckerberg hosts yuval noah harari for a conversation about some big challenges as part of the facebook ceos series of public discussions ab...


## Organizing Data

### Corpus - a collection of text

In [27]:
# Let's add the full name
full_names = ['Mark Zuckerberg']

data_df['full_name'] = full_names
data_df

Unnamed: 0,transcript,full_name
mark,mark zuckerberg hosts yuval noah harari for a conversation about some big challenges as part of the facebook ceos series of public discussions ab...,Mark Zuckerberg


In [28]:
# Pickle it for later use
data_df.to_pickle("corpus.pkl")

### Document-Term Matrix - Tokenizing text

In [29]:
# Create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,ability,able,absolute,absolutely,abstract,abuse,abused,abuses,abusing,abut,...,yes,youd,youll,youre,youtube,youve,yuval,zambia,zen,zuckerberg
mark,5,16,1,1,1,5,1,1,2,1,...,3,4,1,34,1,3,4,1,2,38


In [30]:
# Pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [31]:
# Also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))