# Data Cleaning

` Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out".`

#### Feeding dirty data into a model will give us results that are meaningless.

### Objective:

1. Getting the data
2. Cleaning the data
3. Organizing the data - organize the cleaned data into a way that is easy to input into other algorithms

### Output :
#### cleaned and organized data in two standard text formats:

1. Corpus - a collection of text
2. Document-Term Matrix - word counts in matrix format

## Problem Statement

Look at transcripts of various TV Series and note their similarities and differences and find if the TV Series of your choice is different than other TV Series.


## Getting The Data

The transcripts of TV Series is taken from [Springfield!](https://www.springfieldspringfield.co.uk/).

In [5]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
    '''Returns transcript data specifically from scrapsfromtheloft.com.'''
    page = requests.get(url).text #requests url and retrieves html content in text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find_all(class_="scrolling-script-container",)] #in html structure take the data from all class named site content, convert it to text
    print(url)
    return text

In [6]:
# URLs of transcripts in scope
urls = ['https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=game-of-thrones&episode=s05e09',
        'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=curb-your-enthusiasm&episode=s12e03',
        'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=last-week-tonight-with-john-oliver-2014&episode=s11e01',
        'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=resident-alien-2021&episode=s03e02',
        'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-good-doctor-2017&episode=s07e01',
        'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=death-and-other-details-2024&episode=s01e07',
        'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=avatar-the-last-airbender-2024&episode=s01e01',
        'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=peaky-blinders-2013&episode=s04e06',
        'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=euphoria-2019&episode=s01e01',
        'https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-boys-2019&episode=s04e06'
       ]

# TV Series names
TV_Shows = ['Game Of Thrones', 'Curb Your Enthusiasm', 'Last Week Tonight with John Oliver', 'Resident Alien', 'The Good Doctor', 'Death and Other Details', 'Avatar: The Last Airbender', 'Peaky Blinders', 'Euphoria', 'The Boys']

In [7]:
# # Actually request transcripts (takes a few minutes to run)
transcripts = [url_to_transcript(u) for u in urls]

https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=game-of-thrones&episode=s05e09
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=curb-your-enthusiasm&episode=s12e03
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=last-week-tonight-with-john-oliver-2014&episode=s11e01
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=resident-alien-2021&episode=s03e02
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-good-doctor-2017&episode=s07e01
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=death-and-other-details-2024&episode=s01e07
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=avatar-the-last-airbender-2024&episode=s01e01
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=peaky-blinders-2013&episode=s04e06
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=euphoria-2019&ep

In [8]:
# # Pickle files for later use

# # Make a new directory to hold the text files
!mkdir transcripts

for i, c in enumerate(TV_Shows):
  with open("transcripts/" + c + ".txt", "wb") as file:
      pickle.dump(transcripts[i], file)


mkdir: cannot create directory ‘transcripts’: File exists


In [9]:
# Load pickled files
data = {}
for i, c in enumerate(TV_Shows):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [10]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['Game Of Thrones', 'Curb Your Enthusiasm', 'Last Week Tonight with John Oliver', 'Resident Alien', 'The Good Doctor', 'Death and Other Details', 'Avatar: The Last Airbender', 'Peaky Blinders', 'Euphoria', 'The Boys'])

In [11]:
# More checks
data['Curb Your Enthusiasm'][:2]

['\r\n\r\n\r\n                    \t\t\t- That\'s a long flight, man.\n\t\t\t- Long flight. Yeah.\n\t\t\tBrutal.\n\t\t\tHey, Larry, I see you.\n\t\t\tKeep fighting for us.\n\t\t\tOkay. Huh? How about that?\n\t\t\tMy people showing you\n\t\t\tfucking love, Larry.\n\t\t\tLarry David.\n\t\t\tOh my God, it\'s Larry David.\n\t\t\tSorry, I\'m Sienna Miller.\n\t\t\tI can\'t believe I\'m seeing you. I\'ve\n\t\t\tjust been watching you on the news.\n\t\t\t- How crazy is this, right?\n\t\t\t- It\'s crazy.\n\t\t\tWell, thank you, on behalf\n\t\t\tof everybody who has a heart.\n\t\t\tHey, Larry, good job!\n\t\t\tAnyway, I cannot believe\n\t\t\tthat I\'ve bumped into you.\n\t\t\tIt\'s bashert.\n\t\t\t- Where\'d you get that from?\n\t\t\t- Pretty good, huh?\n\t\t\tYeah, pretty good.\n\t\t\tBye, Larry.\n\t\t\tKeep in touch.\n\t\t\tKeep in touch? You saw that?\n\t\t\tSienna Miller, she was flirting\n\t\t\twith me, was she not?\n\t\t\tSaw that shit.\n\t\t\tThis is a once in a lifetime\n\t\t\topportunit

## Cleaning The Data

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate.

1. Perform the following data cleaning on transcripts:
i) Make text all lower case
ii) Remove punctuation
iii) Remove numerical values
iv) Remove common non-sensical text (/n)
v) Tokenize text
vi) Remove stop words

In [12]:
# Let's take a look at our data again
next(iter(data.keys()))

'Game Of Thrones'

In [13]:
# Notice that our dictionary is currently in key: TV Show, value: list of text format
next(iter(data.values()))

['\r\n\r\n\r\n                    \t\t\tA girl will return to the docks. She will watch the gambler. And then what? A gift for the thin man. Why should I spend my time listening to you? You have no one at your side who understands the land you want to rule. What would you have me do with him? A ruler who kills those devoted to her is not a ruler who inspires devotion. Remove Ser Jorah from the city. You said whoever wins will fight at the Great Pit in front of the queen. Let me fight for her. The Iron Bank has called in the crown\'s debts. We must send an envoy to show these bankers our respect. Ser Meryn will lead your escort. There are only two like it in the world. The one I\'m wearing, the one I gave to Myrcella. Let me send her to Cersei one finger at a time. Dorne\'s too dangerous for you. I\'ve come to take you home. I love Trystane, I\'m going to marry him, and we\'re staying right here. This is about survival. The Night\'s Watch will let you through the tunnel. If you swear yo

In [14]:
# We are going to change this to key: TV Show, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [15]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [16]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
Avatar: The Last Airbender,\r\n\r\n\r\n \t\t\t[mysterious music playing]\n\t\t\t[bell clanging]\n\t\t\t[action music playing]\n\t\t\t[guard] There he is! ...
Curb Your Enthusiasm,"\r\n\r\n\r\n \t\t\t- That's a long flight, man.\n\t\t\t- Long flight. Yeah.\n\t\t\tBrutal.\n\t\t\tHey, Larry, I see you.\n\t\t\..."
Death and Other Details,"\r\n\r\n\r\n \t\t\t﻿1\n\t\t\tKEITH TRUBITSKY aka DANNY TURNER:\n\t\t\tPreviously, on Death and Other Details\n\t\t\t[WHISTLES] ..."
Euphoria,"\r\n\r\n\r\n \t\t\t1 - (HEART BEATING) - (FLUID WHOOSHING) RUE: I was once happy, content, sloshing around in my own private,..."
Game Of Thrones,\r\n\r\n\r\n \t\t\tA girl will return to the docks. She will watch the gambler. And then what? A gift for the thin man. Why sho...
Last Week Tonight with John Oliver,"\r\n\r\n\r\n \t\t\t﻿1\n\t\t\tWelcome to ""Last Week Tonight!""\n\t\t\tI'm John Oliver!\n\t\t\tThank you so much for joining us.\n..."
Peaky Blinders,\r\n\r\n\r\n \t\t\tI just got served a black hand. Everybody will have got one. They're coming for us all. These men will not...
Resident Alien,"\r\n\r\n\r\n \t\t\t﻿1\n\t\t\tPreviously on ""Resident Alien""\n\t\t\tThere are Grey-human\n\t\t\thybrids all over the Earth.\n\t\..."
The Boys,\r\n\r\n\r\n \t\t\t﻿1\n\t\t\t[BUTCHER] The answer to all our prayers.\n\t\t\tA virus that kills Supes.\n\t\t\t[FRENCHIE] Somebo...
The Good Doctor,"\r\n\r\n\r\n \t\t\t﻿1\n\t\t\tPreviously on ""The Good Doctor""\n\t\t\t- Hey! Stop!\n\t\t\t- Danny!\n\t\t\t[TIRES SCREECH]\n\t\t\t..."


In [17]:
# Let's take a look at the transcript for Peaky Blinders
data_df.transcript.loc['Peaky Blinders']

'\r\n\r\n\r\n                    \t\t\tI just got served a black hand.  Everybody will have got one. They\'re coming for us all.  These men will not leave our city until our whole family is dead.  This was all my fault. It was me that shot the old man. John\'s dead because of me.  Within a four-mile radius of the Garrison, every man is a guard and a soldier for us.  We\'re going back, back to Small Heath. That\'s how it works.  An eye for an eye. It\'s called a vendetta.  We are an organization  run different, I imagine.  None of you will survive.  And how is Tommy Shelby OBE going to stop a revolution?  If I get Jessie Eden\'s trust,  she gives me the names of the instigators, and I give them up to the Crown forces.  I want you to help my son achieve his ambition.  You\'re a Peaky Blinder now, son.  We control him on the book, control the odds.  We\'ve got a lot going on though, Tom.  That\'s why it\'ll be good to have the kid around.  Mr Shelby will give you a 20% cut  if you put Gol

In [18]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [19]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
Avatar: The Last Airbender,\r\n\r\n\r\n \t\t\t\n\t\t\t\n\t\t\t\n\t\t\t there he is cut him off\n\t\t\t stop in the name of\n\t\t\t right there\n\t\t\t\n\t...
Curb Your Enthusiasm,\r\n\r\n\r\n \t\t\t thats a long flight man\n\t\t\t long flight yeah\n\t\t\tbrutal\n\t\t\they larry i see you\n\t\t\tkeep fight...
Death and Other Details,\r\n\r\n\r\n \t\t\t﻿\n\t\t\tkeith trubitsky aka danny turner\n\t\t\tpreviously on death and other details\n\t\t\t im rufus must...
Euphoria,\r\n\r\n\r\n \t\t\t heart beating fluid whooshing rue i was once happy content sloshing around in my own private primordial...
Game Of Thrones,\r\n\r\n\r\n \t\t\ta girl will return to the docks she will watch the gambler and then what a gift for the thin man why should ...
Last Week Tonight with John Oliver,\r\n\r\n\r\n \t\t\t﻿\n\t\t\twelcome to last week tonight\n\t\t\tim john oliver\n\t\t\tthank you so much for joining us\n\t\t\tw...
Peaky Blinders,\r\n\r\n\r\n \t\t\ti just got served a black hand everybody will have got one theyre coming for us all these men will not lea...
Resident Alien,\r\n\r\n\r\n \t\t\t﻿\n\t\t\tpreviously on resident alien\n\t\t\tthere are greyhuman\n\t\t\thybrids all over the earth\n\t\t\tif...
The Boys,\r\n\r\n\r\n \t\t\t﻿\n\t\t\t the answer to all our prayers\n\t\t\ta virus that kills supes\n\t\t\t somebodys\n\t\t\tbeen runnin...
The Good Doctor,\r\n\r\n\r\n \t\t\t﻿\n\t\t\tpreviously on the good doctor\n\t\t\t hey stop\n\t\t\t danny\n\t\t\t\n\t\t\tjust promise me\n\t\t\t...


In [20]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation, non-sensical text, and excess whitespace that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)  # Remove specific punctuation like curly quotes and ellipses
    text = re.sub('\n', '', text)  # Remove newline characters
    text = re.sub('\s+', ' ', text)  # Replace multiple spaces, tabs, etc., with a single space
    text = text.strip()  # Remove leading and trailing whitespace
    text = re.sub('[^a-zA-Z\s]', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [21]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
Avatar: The Last Airbender,there he is cut him off stop in the name of right there go go hes over there get him there he is dont let him get away after him stop an earthbend...
Curb Your Enthusiasm,thats a long flight man long flight yeah brutal hey larry i see you keep fighting for us okay huh how about that my people showing you fucking lov...
Death and Other Details,keith trubitsky aka danny turner previously on death and other details im rufus must be imogene imogene the worlds greatest detective im here to ...
Euphoria,heart beating fluid whooshing rue i was once happy content sloshing around in my own private primordial pool then one day for reasons beyond my co...
Game Of Thrones,a girl will return to the docks she will watch the gambler and then what a gift for the thin man why should i spend my time listening to you you h...
Last Week Tonight with John Oliver,welcome to last week tonight im john oliver thank you so much for joining us we are back and we have missed a lot in the last two months from ron...
Peaky Blinders,i just got served a black hand everybody will have got one theyre coming for us all these men will not leave our city until our whole family is de...
Resident Alien,previously on resident alien there are greyhuman hybrids all over the earth if anybodys up to something ill know it im going away max but youre s...
The Boys,the answer to all our prayers a virus that kills supes somebodys been running tests holy shit lets go sameer is the virus gone thats the only dos...
The Good Doctor,previously on the good doctor hey stop danny just promise me what no opioids he asked you not to do that hes tachycardic and hypertensive because...


## Organizing The Data


1. Organized data in two standard text formats:
   a) Corpus - corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.
   b) Document-Term Matrix - word counts in matrix format

### Corpus

A corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [22]:
# Let's take a look at our dataframe
data_df

Unnamed: 0,transcript
Avatar: The Last Airbender,\r\n\r\n\r\n \t\t\t[mysterious music playing]\n\t\t\t[bell clanging]\n\t\t\t[action music playing]\n\t\t\t[guard] There he is! ...
Curb Your Enthusiasm,"\r\n\r\n\r\n \t\t\t- That's a long flight, man.\n\t\t\t- Long flight. Yeah.\n\t\t\tBrutal.\n\t\t\tHey, Larry, I see you.\n\t\t\..."
Death and Other Details,"\r\n\r\n\r\n \t\t\t﻿1\n\t\t\tKEITH TRUBITSKY aka DANNY TURNER:\n\t\t\tPreviously, on Death and Other Details\n\t\t\t[WHISTLES] ..."
Euphoria,"\r\n\r\n\r\n \t\t\t1 - (HEART BEATING) - (FLUID WHOOSHING) RUE: I was once happy, content, sloshing around in my own private,..."
Game Of Thrones,\r\n\r\n\r\n \t\t\tA girl will return to the docks. She will watch the gambler. And then what? A gift for the thin man. Why sho...
Last Week Tonight with John Oliver,"\r\n\r\n\r\n \t\t\t﻿1\n\t\t\tWelcome to ""Last Week Tonight!""\n\t\t\tI'm John Oliver!\n\t\t\tThank you so much for joining us.\n..."
Peaky Blinders,\r\n\r\n\r\n \t\t\tI just got served a black hand. Everybody will have got one. They're coming for us all. These men will not...
Resident Alien,"\r\n\r\n\r\n \t\t\t﻿1\n\t\t\tPreviously on ""Resident Alien""\n\t\t\tThere are Grey-human\n\t\t\thybrids all over the Earth.\n\t\..."
The Boys,\r\n\r\n\r\n \t\t\t﻿1\n\t\t\t[BUTCHER] The answer to all our prayers.\n\t\t\tA virus that kills Supes.\n\t\t\t[FRENCHIE] Somebo...
The Good Doctor,"\r\n\r\n\r\n \t\t\t﻿1\n\t\t\tPreviously on ""The Good Doctor""\n\t\t\t- Hey! Stop!\n\t\t\t- Danny!\n\t\t\t[TIRES SCREECH]\n\t\t\t..."


In [23]:
# # Let's add the TV Shows' full names as well
full_names = ['AVATAR: THE LAST AIRBENDER – S01E01 – AANG', 'CURB YOUR ENTHUSIASM – S12E03 – VERTICAL DROP, HORIZONTAL TUG', 'DEATH AND OTHER DETAILS – S01E07 – MEMORABLE','EUPHORIA-PILOT', 'GAME-OF-THRONES-BATTLE-OF-BASTARDS', 'SUPREME COURT CORRUPTION: LAST WEEK TONIGHT WITH JOHN OLIVER','PEAKY-BLINDERS-S4-E6-THE-COMPANY', 'RESIDENT-ALIEN-S3-E2-THE-UPPER-HAND','THE-BOYS-DIRTY-BUSINESS', 'THE GOOD DOCTOR – S07E01 – BABY, BABY, BABY']

data_df['full_name'] = full_names
data_df

Unnamed: 0,transcript,full_name
Avatar: The Last Airbender,\r\n\r\n\r\n \t\t\t[mysterious music playing]\n\t\t\t[bell clanging]\n\t\t\t[action music playing]\n\t\t\t[guard] There he is! ...,AVATAR: THE LAST AIRBENDER – S01E01 – AANG
Curb Your Enthusiasm,"\r\n\r\n\r\n \t\t\t- That's a long flight, man.\n\t\t\t- Long flight. Yeah.\n\t\t\tBrutal.\n\t\t\tHey, Larry, I see you.\n\t\t\...","CURB YOUR ENTHUSIASM – S12E03 – VERTICAL DROP, HORIZONTAL TUG"
Death and Other Details,"\r\n\r\n\r\n \t\t\t﻿1\n\t\t\tKEITH TRUBITSKY aka DANNY TURNER:\n\t\t\tPreviously, on Death and Other Details\n\t\t\t[WHISTLES] ...",DEATH AND OTHER DETAILS – S01E07 – MEMORABLE
Euphoria,"\r\n\r\n\r\n \t\t\t1 - (HEART BEATING) - (FLUID WHOOSHING) RUE: I was once happy, content, sloshing around in my own private,...",EUPHORIA-PILOT
Game Of Thrones,\r\n\r\n\r\n \t\t\tA girl will return to the docks. She will watch the gambler. And then what? A gift for the thin man. Why sho...,GAME-OF-THRONES-BATTLE-OF-BASTARDS
Last Week Tonight with John Oliver,"\r\n\r\n\r\n \t\t\t﻿1\n\t\t\tWelcome to ""Last Week Tonight!""\n\t\t\tI'm John Oliver!\n\t\t\tThank you so much for joining us.\n...",SUPREME COURT CORRUPTION: LAST WEEK TONIGHT WITH JOHN OLIVER
Peaky Blinders,\r\n\r\n\r\n \t\t\tI just got served a black hand. Everybody will have got one. They're coming for us all. These men will not...,PEAKY-BLINDERS-S4-E6-THE-COMPANY
Resident Alien,"\r\n\r\n\r\n \t\t\t﻿1\n\t\t\tPreviously on ""Resident Alien""\n\t\t\tThere are Grey-human\n\t\t\thybrids all over the Earth.\n\t\...",RESIDENT-ALIEN-S3-E2-THE-UPPER-HAND
The Boys,\r\n\r\n\r\n \t\t\t﻿1\n\t\t\t[BUTCHER] The answer to all our prayers.\n\t\t\tA virus that kills Supes.\n\t\t\t[FRENCHIE] Somebo...,THE-BOYS-DIRTY-BUSINESS
The Good Doctor,"\r\n\r\n\r\n \t\t\t﻿1\n\t\t\tPreviously on ""The Good Doctor""\n\t\t\t- Hey! Stop!\n\t\t\t- Danny!\n\t\t\t[TIRES SCREECH]\n\t\t\t...","THE GOOD DOCTOR – S07E01 – BABY, BABY, BABY"


In [24]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

### Document-Term Matrix

For many of the techniques we'll be using in future assignment, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's ` CountVectorizer `, where every row will represent a different document and every column will represent a different word.

In addition, with ` CountVectorizer `, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [25]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,aa,aah,aang,aaron,abandon,abandoning,abbott,abduct,abducted,abduction,...,youreyoure,youve,zaps,zebra,zendaya,zero,zoe,zombies,zone,zuko
Avatar: The Last Airbender,0,0,23,0,0,0,0,0,0,0,...,0,4,0,0,0,0,0,0,0,2
Curb Your Enthusiasm,1,0,0,0,0,0,0,0,0,0,...,0,4,0,0,0,0,0,0,0,0
Death and Other Details,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,1,0
Euphoria,0,1,0,0,0,0,0,0,0,0,...,0,4,0,0,0,1,0,1,0,0
Game Of Thrones,0,0,0,0,1,1,0,1,0,0,...,0,2,0,0,0,0,0,0,0,0
Last Week Tonight with John Oliver,0,0,0,0,0,0,0,0,0,0,...,0,4,0,0,0,1,0,0,0,0
Peaky Blinders,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0
Resident Alien,0,0,0,0,0,0,0,0,1,1,...,0,1,1,1,0,0,0,0,0,0
The Boys,0,0,0,0,0,0,0,0,0,0,...,1,6,0,0,2,0,2,0,0,0
The Good Doctor,0,0,0,2,0,0,1,0,0,0,...,0,3,0,0,0,1,0,0,0,0


In [26]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [27]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))