# Data Cleaning + Sentiment Analysis

In this notebook, we will go through the basic data science pipeline of asking a research question, collecting data, doing some data cleaning, and then doing sentiment analysis

**RQ:** In what way standup comedians different from each other?

**Data collection**: Where can we get data from? Luckily https://scrapsfromtheloft.com has transcripts from standup comedy performance. 

Here is one from my favporite comedian - Trevor Noah:
https://scrapsfromtheloft.com/2018/11/21/trevor-noah-son-of-patricia-transcript/

**Cleaning the data**: Last week's lab covered a bunch of data cleaning operations. Let's use that learning from last class to clean our transcript data. We will perform some popular text pre-processing techniques here.

**Organizing the data**: We will also be organizing the cleaned data in a way that is easy to input into other algorithms for later analysis

In [None]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
    '''Returns transcript data specifically from scrapsfromtheloft.com.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find(class_="post-content").find_all('p')]
    print(url)
    return text

# URLs of transcripts in scope
urls = ['http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcript/',
        'http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/',
        'http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcript/',
        'http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-transcript/',
        'http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-2014-full-transcript/',
        'http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-full-transcript/',
        'http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming-king-2017-full-transcript/',
        'http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-prayers-2015-full-transcript/',
        'http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-boyfriend-2013-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-transcript/']

# Comedian names
comedians = ['louis', 'dave', 'ricky', 'bo', 'bill', 'jim', 'john', 'hasan', 'ali', 'anthony', 'mike', 'joe']

In [None]:
# Actually request transcripts (takes a few minutes to run)
transcripts = [url_to_transcript(u) for u in urls]

In [None]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
    '''Returns transcript data specifically from scrapsfromtheloft.com.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find(class_="elementor-widget-theme-post-content").find_all('p')]
    print(url)
    return text

# URLs of transcripts in scope
urls = ['http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcript/',
        'http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/',
        'http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcript/',
        'http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-transcript/',
        'http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-2014-full-transcript/',
        'http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-full-transcript/',
        'http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming-king-2017-full-transcript/',
        'http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-prayers-2015-full-transcript/',
        'http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-boyfriend-2013-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-transcript/']


# Comedian names
comedians = ['louis', 'dave', 'ricky', 'bo', 'bill', 'jim', 'john', 'hasan', 'ali', 'anthony', 'mike', 'joe']

In [None]:
# Actually request transcripts (takes a few minutes to run)
transcripts = [url_to_transcript(u) for u in urls]

In [None]:
transcripts

In [None]:
# Pickle files for later use

# Make a new directory to hold the text files
!mkdir transcripts

for i, c in enumerate(comedians):
     with open("transcripts/" + c + ".txt", "wb") as file:
         pickle.dump(transcripts[i], file)

## Load the data

Here you will also see another nifty trick: [pickling objects](https://docs.python.org/3/library/pickle.html)

pickle allows you to store the object in its current form and then load it later in another notebook (for e.g.) for reusing that object.
So you can pickle a list, a data file, a data frame.....etc.

See: https://wiki.python.org/moin/UsingPickle

`pickle.dump` and `pickle.load` are the two most important pickling functions that will come handy

In [None]:
data = {}
for i, c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [None]:
# Double check to make sure data has been loaded properly
data.keys() #every key is a comedian and every value is a comedian

In [None]:
# More checks
print(len(data['louis']))
data['louis'][:1] #notice that the transcript data is separated into paragraphs and each entry is it's own list

## Data cleaning

When numerical data, data cleaning often involves removing null values, duplicate data, outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. We saw a glimpse of this in our last lab as well.

Let's start simple and iterate. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results.

**Common data cleaning steps on all text:**

* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**When dealing with messy social media data, these steps can blow up**
* removing @ mentions for tweets
* removing # for tweets 
* or treating @ and # as fixed type of token 

**More data cleaning steps after tokenization:**

* Stemming / lemmatization
* Parts of speech tagging [*We will go over in next lab*]
* Create bi-grams or tri-grams
* Deal with typos
* And more...

In [None]:
# Let's take a look at our data again
data['louis'][0]

In [None]:
data['louis'][1]

**Combining the data**

notice that the transcript data is separated into paragraphs and each entry is it's own list.
Let's change this by combining all paragraphs into one transcript.

In [None]:
# We are going to change this to key: comedian, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [None]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [None]:
#check to see
data_combined

#### pandas dataframe
We can either keep it in dictionary format or put it into a pandas dataframe

In [None]:
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose() 
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

In [None]:
# see what you would have gotten if you didn't transpose
data_df2 = pd.DataFrame.from_dict(data_combined) 
#data_df.columns = ['transcript']
#data_df = data_df.sort_index()
data_df2

Let's take a look at the transcript for Hasan Minah. Key = hasan

<span class="mark">Can you fill</span> in the code below? Recall how you can index from pandas dataframe when given key.

In [None]:
# Let's take a look at the transcript for Hasan Minaj. Key = hasan

# Your code below Fill in the code below



### cleaning - first round

python regular expression will be useful here.

In [None]:
import re, string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower() # make all text to lowercase
    text = re.sub('\[.*?\]', '', text) # getting rid of data in brackets. See the usage of sub
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) #get rid of punctuation marks
    text = re.sub('\w*\d\w*', '', text) #\d all digits, \w alphanumeric. Get rid of words containing numbers
    return text

round1 = lambda x: clean_text_round1(x)

In [None]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean

In [None]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [None]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

**NOTE:** This data cleaning or text pre-processing step could go on for a while, but we acan stop for now. After going through some analysis techniques, if you see that the results don't make sense or could be improved, you can come back and make more edits such as:

* Mark 'driving' and 'drive' as the same word (stemming / lemmatization)
* Combine 'thank you' into one term (bi-grams)
* And a lot more...

## Organizing the cleaned data

Let's organize and save the clean data into the following two standard text formats:

* Corpus - a collection of text
* Document-Term Matrix - word counts in matrix format

### Corpus

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [None]:
# Let's take a look at our dataframe
data_df

In [None]:
full_names = ['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Bo Burnham', 'Dave Chappelle', 'Hasan Minhaj',
              'Jim Jefferies', 'Joe Rogan', 'John Mulaney', 'Louis C.K.', 'Mike Birbiglia', 'Ricky Gervais']

data_df['full_name'] = full_names
data_df

This way you can also continuing adding more columns to your dataframe. Say characteristics about the comedians (which year they were born, when they first started standup, etc.)

In [None]:
# Let's pickle the corpus for later use
data_df.to_pickle("corpus.pkl")

### Document-Term Matrix

Once the text is tokenized, every row can be represented as a different document and every column as a different word (or tokens)

python scikit-learn's [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) comes in handy

Not just tokenizing, but you can do a bunch of other things (removing stop words, including bigrams,...). You can also shift tab to see documentation within code cell.

In [None]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english') #first instantiate CountVectorizer object and the also removing stopwords here
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

In [None]:
# Let's pickle the document term matrix for later use
data_dtm.to_pickle("dtm.pkl")

In [None]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))

**TODOs for later**

* Further cleaning: add an additional regular expression to the clean_text_round2 function to further clean the text?
* Play around with CountVectorizer's parameters. What is ngram_range? What is min_df and max_df?

# Sentiment Analysis



In [None]:
# we will read the corpus, where the order of the words are preserved
data = pd.read_pickle('corpus.pkl')
data

In [None]:
# Create quick lambda functions to find the polarity and subjectivity of each routine
from textblob import TextBlob

pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

data['polarity'] = data['transcript'].apply(pol)
data['subjectivity'] = data['transcript'].apply(sub)
data

In [None]:
# Let's plot the results
import matplotlib.pyplot as plt

#plt.rcParams['figure.figsize'] = [10, 8]

for index, comedian in enumerate(data.index):
    x = data.polarity.loc[comedian]
    y = data.subjectivity.loc[comedian]
    plt.scatter(x, y, color='blue')
    plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10)
    plt.xlim(-.01, .12) 
    
plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)

plt.show()

**What can you infer?**