# **A text preprocessing pipeline from the scratch**

### **Includes:**
*  Getting data (Scraping)
* lowercasing
* remove html tags
* remove urls
* remove punctuations
* Stop words removal
* emoji handling
* tokenize
* stemming
* lemmatization

### ***Let's Get Started:***

## Scraping Data

In [2]:
import requests
from bs4 import BeautifulSoup
import pickle

In [3]:
def url_to_transcript(url):

  # get all text from url
  page = requests.get(url).text
  # parse through page with parser (lxml)
  soup = BeautifulSoup(page,'lxml')
  # get texts
  text = [p.text for p in soup.find(class_='ast-container').find_all('p')]
  print("Extracted from URL:", url)

  return text

In [4]:
# getting transcripts of oppenheimer and barbie movies from scrapsfromtheloft.com

urls = ['https://scrapsfromtheloft.com/movies/oppenheimer-2023-transcript/',
        'https://scrapsfromtheloft.com/movies/barbie-2023-transcript/']

names = ['oppenheimer','barbie']

In [5]:
transcripts = [url_to_transcript(url) for url in urls]

Extracted from URL: https://scrapsfromtheloft.com/movies/oppenheimer-2023-transcript/
Extracted from URL: https://scrapsfromtheloft.com/movies/barbie-2023-transcript/


In [6]:
# pickle for later use

!mkdir transcripts

for i,movie in enumerate(names):
  with open("transcripts/" + movie +'.txt', 'wb') as file:
    pickle.dump(transcripts[i], file)

In [7]:
## Load pickled files

data = {}
for i,m in enumerate(names):
  with open('transcripts/'+m+'.txt', 'rb') as file:
    data[m] = pickle.load(file)

In [8]:
data.keys()

dict_keys(['oppenheimer', 'barbie'])

In [9]:
data['barbie'][:25]

['Stereotypical\xa0Barbie (“Barbie”) and fellow dolls reside in Barbieland; a\xa0matriarchal\xa0society with different variations of Barbies, Kens, and a group of discontinued models, who are treated like outcasts due to their unconventional traits. While the Kens spend their days playing at the beach, considering it as their profession, the Barbies hold prestigious jobs such as doctors, lawyers, and politicians. Beach Ken (“Ken”) is only happy when he is with Barbie and seeks a closer relationship, but Barbie rebuffs him in favor of other activities and female friendships.',
 '* * *',
 '[narrator] Since the beginning of time, since the first little girl ever existed, there have been dolls.',
 'But the dolls were always and forever baby dolls.',
 'The girls who played with them could only ever play at being mothers.',
 'Which can be fun, at least for a while, anyway.',
 'Ask your mother.',
 'This continued until…',
 '[“Thus Spoke Zarathustra” plays]',
 '[ting]',
 '[narrator] Yes, Barbi

## Data Cleaning

In [10]:
# our data is scattered by commas and lists and so on
# let's combine our data

def combine_texts(text_list):
  combined_text = ' '.join(text_list)
  return combined_text

In [11]:
combined_data = {key:[combine_texts(value)] for (key,value) in data.items()}

In [12]:
combined_data['barbie']



In [13]:
## Making a dataframe

In [14]:
import pandas as pd

pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(combined_data).transpose()
data_df.columns = ['transcript']
data_df.head()

Unnamed: 0,transcript
oppenheimer,"Oppenheimer is a 2023 biographical thriller film directed by Christopher Nolan, starring Cillian Murphy as J. Robert Oppenheimer, the physicist kn..."
barbie,"Stereotypical Barbie (“Barbie”) and fellow dolls reside in Barbieland; a matriarchal society with different variations of Barbies, Kens, and a gro..."


In [15]:
data_df = data_df.sort_index()
data_df.head()

Unnamed: 0,transcript
barbie,"Stereotypical Barbie (“Barbie”) and fellow dolls reside in Barbieland; a matriarchal society with different variations of Barbies, Kens, and a gro..."
oppenheimer,"Oppenheimer is a 2023 biographical thriller film directed by Christopher Nolan, starring Cillian Murphy as J. Robert Oppenheimer, the physicist kn..."


In [16]:
# Transcript of Barbie
data_df.transcript.loc['barbie']



**Lowercasing**

In [17]:
def lower_data(text):
  return text.lower()

**Removing punctuations, [ ] brackets and numbers**

In [18]:
import re
import string

def clean_data(text):
  # Remove square brackets content, punctuation, words with numbers
  text = re.sub('\[.*?\]', '', text)
  text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
  text = re.sub('\w*\d\w*', '', text)

  # Remove special characters and newlines
  text = re.sub('[‘’“”…]', '', text)
  text = re.sub('\n', '', text)
  return text

In [19]:
def remove_emoji(text):
  emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F700-\U0001F77F"  # alchemical symbols
                           u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                           u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                           u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                           u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                           u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                           u"\U00002702-\U000027B0"  # Dingbats
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
  text = emoji_pattern.sub(r'', text)

  return text


In [20]:
cleaner = lambda x: remove_emoji(clean_data(lower_data(x)))

In [29]:
cleaned_data = pd.DataFrame(data_df.transcript.apply(cleaner))
cleaned_data

Unnamed: 0,transcript
barbie,stereotypical barbie barbie and fellow dolls reside in barbieland a matriarchal society with different variations of barbies kens and a group of d...
oppenheimer,oppenheimer is a biographical thriller film directed by christopher nolan starring cillian murphy as j robert oppenheimer the physicist known as ...


In [30]:
cleaned_data.transcript['barbie']

'stereotypical\xa0barbie barbie and fellow dolls reside in barbieland a\xa0matriarchal\xa0society with different variations of barbies kens and a group of discontinued models who are treated like outcasts due to their unconventional traits while the kens spend their days playing at the beach considering it as their profession the barbies hold prestigious jobs such as doctors lawyers and politicians beach ken ken is only happy when he is with barbie and seeks a closer relationship but barbie rebuffs him in favor of other activities and female friendships     since the beginning of time since the first little girl ever existed there have been dolls but the dolls were always and forever baby dolls the girls who played with them could only ever play at being mothers which can be fun at least for a while anyway ask your mother this continued until     yes barbie changed everything then she changed it all again all of these women are barbie and barbie is all of these women she might have sta

We can see some text like \xa0 in our data.
The \xa0 characters in your text represent a non-breaking space in Unicode. It is the hexadecimal representation for the non-breaking space character. In Python strings, you may encounter this representation, and it is equivalent to a regular space in many cases.



In [31]:
# removing non-breaking space in Unicode
def remove_unicode_space(text):
  text = text.replace('\xa0', ' ')
  return text

cleaner2 = lambda x: remove_unicode_space(x)

cleaned_data = pd.DataFrame(cleaned_data.transcript.apply(cleaner2))
cleaned_data.transcript['barbie']

'stereotypical barbie barbie and fellow dolls reside in barbieland a matriarchal society with different variations of barbies kens and a group of discontinued models who are treated like outcasts due to their unconventional traits while the kens spend their days playing at the beach considering it as their profession the barbies hold prestigious jobs such as doctors lawyers and politicians beach ken ken is only happy when he is with barbie and seeks a closer relationship but barbie rebuffs him in favor of other activities and female friendships     since the beginning of time since the first little girl ever existed there have been dolls but the dolls were always and forever baby dolls the girls who played with them could only ever play at being mothers which can be fun at least for a while anyway ask your mother this continued until     yes barbie changed everything then she changed it all again all of these women are barbie and barbie is all of these women she might have started out 

In [32]:
cleaned_data

Unnamed: 0,transcript
barbie,stereotypical barbie barbie and fellow dolls reside in barbieland a matriarchal society with different variations of barbies kens and a group of d...
oppenheimer,oppenheimer is a biographical thriller film directed by christopher nolan starring cillian murphy as j robert oppenheimer the physicist known as ...


### ***Niceeeee!! Our data looks clean***

## ***Data Organization***

We can organiza our data in various forms.
* Corpus: A large collection of textual data. As in dataframe format.
* Document-Term Matrix (DTM): Arranging data with their counts in matrix format.

In [33]:
# cleaned_data is in corpus form already so leaving it as it is

In [35]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(cleaned_data.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_dtm.index = cleaned_data.index
data_dtm

Unnamed: 0,aaron,ability,able,abomb,abort,absolutely,abstract,absurd,abundant,academia,...,york,youd,youll,young,youngstown,youre,youve,zack,zero,zurich
barbie,16,0,2,0,0,0,0,0,0,0,...,1,3,10,1,0,74,8,1,0,0
oppenheimer,0,1,3,4,1,3,1,2,2,1,...,3,5,9,1,1,64,9,0,4,1




There are total of 3857 columns including each word from both the transctipts and excluding the stopwords.