# Data Cleaning

` Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out".`

#### Feeding dirty data into a model will give us results that are meaningless.

### Objective:

1. Getting the data 
2. Cleaning the data 
3. Organizing the data - organize the cleaned data into a way that is easy to input into other algorithms

### Output :
#### cleaned and organized data in two standard text formats:

1. Corpus - a collection of text
2. Document-Term Matrix - word counts in matrix format

## Problem Statement

Look at transcripts of various comedians and note their similarities and differences and find if the stand up comedian of your choice has comedy style different than other comedian.


## Getting The Data

You can get the transcripts of some comedian from [Scraps From The Loft](http://scrapsfromtheloft.com). 

You can take help of IMDB and select only 10 or 20 comedian having highest rating.






### For example:

In [1]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle
import nltk
from nltk.corpus import stopwords

# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
    '''Returns transcript data specifically from scrapsfromtheloft.com.'''
    r = requests.get(url)
    htmlContent = r.content
    soup = BeautifulSoup(htmlContent, 'html.parser')
    text = [p.text for p in soup.find_all('p', {'style': 'text-align: justify;'})]
    print(url)
    return text

# URLs of transcripts in scope
urls = ['https://scrapsfromtheloft.com/comedy/moses-storm-trash-white-transcript/',
        'https://scrapsfromtheloft.com/comedy/chris-rock-bigger-blacker-1999-full-transcript/',
        'https://scrapsfromtheloft.com/comedy/tom-segura-disgraceful-2018-full-transcript/',
        'https://scrapsfromtheloft.com/comedy/trevor-noah-white-house-correspondents-dinner-2022-transcript/',
        'https://scrapsfromtheloft.com/comedy/gabriel-iglesias-stadium-fluffy-transcript/',
        'https://scrapsfromtheloft.com/comedy/iliza-shlesinger-hot-forever-transcript/',
        'https://scrapsfromtheloft.com/comedy/fortune-feimster-good-fortune-transcript/',
        'https://scrapsfromtheloft.com/comedy/deon-cole-charleens-boy-transcript/',
        'https://scrapsfromtheloft.com/comedy/patton-oswalt-we-all-scream-transcript/',
        'http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcript/',
        'http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/',
        'http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcript/',
        'http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-2014-full-transcript/',
        'http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-full-transcript/',
        'http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-prayers-2015-full-transcript/',
        'http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-boyfriend-2013-full-transcript/',
        'http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-transcript/']

# Comedian names
comedians = ['moses', 'chris', 'tom', 'trevor', 'gabriel', 'iliza', 'fortune', 'deon', 'patton', 'louis', 'dave', 'ricky', 'bill', 'jim', 'john', 'ali', 'anthony', 'mike', 'joe']

In [2]:
# # Actually request transcripts (takes a few minutes to run)
transcripts = [url_to_transcript(u) for u in urls]

https://scrapsfromtheloft.com/comedy/moses-storm-trash-white-transcript/
https://scrapsfromtheloft.com/comedy/chris-rock-bigger-blacker-1999-full-transcript/
https://scrapsfromtheloft.com/comedy/tom-segura-disgraceful-2018-full-transcript/
https://scrapsfromtheloft.com/comedy/trevor-noah-white-house-correspondents-dinner-2022-transcript/
https://scrapsfromtheloft.com/comedy/gabriel-iglesias-stadium-fluffy-transcript/
https://scrapsfromtheloft.com/comedy/iliza-shlesinger-hot-forever-transcript/
https://scrapsfromtheloft.com/comedy/fortune-feimster-good-fortune-transcript/
https://scrapsfromtheloft.com/comedy/deon-cole-charleens-boy-transcript/
https://scrapsfromtheloft.com/comedy/patton-oswalt-we-all-scream-transcript/
http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcript/
http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/
http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcript/
http://scrapsfromtheloft.com/2017/0

In [3]:
# # Pickle files for later use

# Make a new directory to hold the text files
!mkdir transcripts

for i, c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "wb") as file:
        pickle.dump(transcripts[i], file)

In [4]:
# Load pickled files
data = {}
for i, c in enumerate(comedians):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [5]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['moses', 'chris', 'tom', 'trevor', 'gabriel', 'iliza', 'fortune', 'deon', 'patton', 'louis', 'dave', 'ricky', 'bill', 'jim', 'john', 'ali', 'anthony', 'mike', 'joe'])

In [6]:
# More checks
data['gabriel'][:2]

['[man] Can you please state your name? Martin Moreno. But you might know me as… Martinnnnn! I’ve been touring with Gabriel Iglesias for 20-plus years. Martinnnnn! And, yeah, he’s been screaming my name for 20-plus years. Hurry up, Martinnnnn! Dude, that’s better than most marriages. That’s a win. It has been an incredible journey. We’ve gone from garages, clubs, living rooms, to theaters, arenas around the world, and now a stadium. They say comedy is subjective, but when you’re selling out stadiums, that’s no longer subjective. So, what can people expect? One of the biggest shows Los Angeles has ever seen. And whatever you do, make sure you stick around till the end. You want to see how this thing ends. So, what else is left to say? Without further ado, live from Dodger Stadium, Gabriel Iglesias!',
 '[theme playing from 2001: A Space Odyssey]']

## Cleaning The Data

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate.
### Assignment:
1. Perform the following data cleaning on transcripts:
i) Make text all lower case
ii) Remove punctuation
iii) Remove numerical values
iv) Remove common non-sensical text (/n)
v) Tokenize text
vi) Remove stop words

In [7]:
# Let's take a look at our data again
next(iter(data.keys()))

'moses'

In [8]:
# Notice that our dictionary is currently in key: comedian, value: list of text format
next(iter(data.values()))

['♪♪♪',
 '[cheers and applause]',
 '♪♪♪',
 '[cheers and applause continue]',
 'Crazy will always beat scary. Do you know what I mean by that? It’s not a great thesis. It’s not profound, but legitimately, that is the closest I have come to forgiveness in my life. So for most of my life, my mom was a single parent. Five kids. No child support. We were on food stamps. When those ran out, we would dumpster dive for food. A lot of people find it hard to believe that I was ever that poor, ’cause look at this shit. [laughter] Like, not only do I look like, “Meh, everything just, like, worked out.” I look like the kind of, like, white, wealthy– I look like I was conceived in an Ivy League a capella concert. [laughter]',
 '[cheers and applause]',
 'You know what I mean? Where it is, like, that…',
 '♪ Shimmy-doo-wop, my dad owns every university ♪\n♪ Shimmy-doo-wop, what is adversity? ♪',
 'It’s not just rich either, right? It is, like, evil rich, right? It’s like a “Game of Thrones” King Joffre

In [9]:
# We are going to change this to key: comedian, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [10]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [11]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
ali,"Ladies and gentlemen, please welcome to the stage: Ali Wong! Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have ..."
anthony,"Thank you. Thank you. Thank you, San Francisco. Thank you so much. So good to be here. People were surprised when I told ’em I was gonna tape my s..."
bill,"[cheers and applause] All right, thank you! Thank you very much! Thank you. Thank you. Thank you. How are you? What’s going on? Thank you. It’s a ..."
chris,"Ladies and gentlemen… live from the world-famous Apollo Theater… in Harlem, New York. Are you ready? Please welcome Mr. Chris Rock! What’s up… New..."
dave,"This is Dave. He tells dirty jokes for a living. That stare is where most of his hard work happens. It signifies a profound train of thought, the ..."
deon,"[indistinct chattering] [woman] Oh, this water is so good. I don’t know why I was so thirsty. But anyway, I feel comfortable now. It feels real go..."
fortune,[upbeat music plays] [audience cheering] [announcer] Please welcome Fortune Feimster! ♪ I’m a powerful woman ♪ ♪ Always get what I want ♪ ♪ So don...
gabriel,[man] Can you please state your name? Martin Moreno. But you might know me as… Martinnnnn! I’ve been touring with Gabriel Iglesias for 20-plus yea...
iliza,"[upbeat music playing] [crowd cheering] Cleveland, Ohio! Thank you! Thank you so much. This is so great. This is so nice to be here with you in pu..."
jim,"[Car horn honks] [Audience cheering] [Announcer] Ladies and gentlemen, please welcome to the stage Mr. Jim Jefferies! [Upbeat music playing] Hello..."


In [12]:
# Let's take a look at the transcript for Ali Wong
data_df.transcript.loc['gabriel']

"[man] Can you please state your name? Martin Moreno. But you might know me as… Martinnnnn! I’ve been touring with Gabriel Iglesias for 20-plus years. Martinnnnn! And, yeah, he’s been screaming my name for 20-plus years. Hurry up, Martinnnnn! Dude, that’s better than most marriages. That’s a win. It has been an incredible journey. We’ve gone from garages, clubs, living rooms, to theaters, arenas around the world, and now a stadium. They say comedy is subjective, but when you’re selling out stadiums, that’s no longer subjective. So, what can people expect? One of the biggest shows Los Angeles has ever seen. And whatever you do, make sure you stick around till the end. You want to see how this thing ends. So, what else is left to say? Without further ado, live from Dodger Stadium, Gabriel Iglesias! [theme playing from 2001: A Space Odyssey] [ignition cranking] [engine not starting] [ignition cranking] [engine starts] ♪ California love ♪\n♪ California ♪\n♪ Knows how to party ♪\n♪ Californ

In [13]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [14]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean
##
data_clean.transcript.loc['gabriel']

' can you please state your name martin moreno but you might know me as… martinnnnn i’ve been touring with gabriel iglesias for  years martinnnnn and yeah he’s been screaming my name for  years hurry up martinnnnn dude that’s better than most marriages that’s a win it has been an incredible journey we’ve gone from garages clubs living rooms to theaters arenas around the world and now a stadium they say comedy is subjective but when you’re selling out stadiums that’s no longer subjective so what can people expect one of the biggest shows los angeles has ever seen and whatever you do make sure you stick around till the end you want to see how this thing ends so what else is left to say without further ado live from dodger stadium gabriel iglesias      ♪ california love ♪\n♪ california ♪\n♪ knows how to party ♪\n♪ california ♪\n♪ knows how to party ♪\n♪ in the city of la ♪\n♪ in the city of good ol’ watts ♪\n♪ in the city the city of compton ♪\n♪ we keep it rockin’ ♪\n♪ we keep it rockin’

In [15]:
# Apply a second round of cleaning
# import nltk

# def get_stem(text):
#     stemmer = nltk.porter.PorterStemmer()
#     text = ' '.join([stemmer.stem(word) for word in text.split()])
#     return text

def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    
#     text = re.sub(r'\b(hi|hello|welcome|hey|thank|you|bye|cheers)\b', ' ', text) #remove words used for greeting
    text = re.sub('\(.*?\)', '', text) #remove words within parenthesis
    text = re.sub('♪', '', text) #remove music
    text = re.sub(r'[^a-zA-Z\s]', '', text) #remove all non alphabetic characters except spaces
    text = re.sub(r'\b\w{1,2}\b', ' ', text) #remove all single and double letterd words like "um" "uh" etc. because they are meaningless
#     text = re.sub(r'\b(a+h+|(h+a+)+)\b', '', text) #remove screaming or laughing noises
    text = re.sub(r"\s+", " ", text) #remove multiple spaces
#     text = get_stem(text) #stemming (reducing derived words to their root form)
    return text

round2 = lambda x: clean_text_round2(x)

In [16]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean
data_clean.transcript.loc['fortune']

' please welcome fortune feimster powerful woman always get what want dont you get way now thats not what want cause powerful woman always get what want dont you get way now thats not what want cause powerful woman always get what want dont you get way now thats not what want cause powerful woman always get what powerful woman yeah man stop stop chicago whats going man thank you for being here the beautiful chicago shakespeare theater lot has transpired the last couple years right the world has dealt with some crazy stuff felt like the end times and thought lesbians would built for that you know put bunker with some canned hams were good but things went south and learned lot about myself found out have zero survival skills none all had was stay home got nothing accomplished sourdough was started house didnt learn how make cold brew nothing was partner jax who surprised she was the one outside painting our fence rewired our electricity fixed our plumbing was the one the couch every nigh

## Organizing The Data

### Assignment:
1. Organized data in two standard text formats:
   a) Corpus - corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.
   b) Document-Term Matrix - word counts in matrix format

### Corpus: Example

A corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [17]:
# Let's take a look at our dataframe
data_df

Unnamed: 0,transcript
ali,"Ladies and gentlemen, please welcome to the stage: Ali Wong! Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have ..."
anthony,"Thank you. Thank you. Thank you, San Francisco. Thank you so much. So good to be here. People were surprised when I told ’em I was gonna tape my s..."
bill,"[cheers and applause] All right, thank you! Thank you very much! Thank you. Thank you. Thank you. How are you? What’s going on? Thank you. It’s a ..."
chris,"Ladies and gentlemen… live from the world-famous Apollo Theater… in Harlem, New York. Are you ready? Please welcome Mr. Chris Rock! What’s up… New..."
dave,"This is Dave. He tells dirty jokes for a living. That stare is where most of his hard work happens. It signifies a profound train of thought, the ..."
deon,"[indistinct chattering] [woman] Oh, this water is so good. I don’t know why I was so thirsty. But anyway, I feel comfortable now. It feels real go..."
fortune,[upbeat music plays] [audience cheering] [announcer] Please welcome Fortune Feimster! ♪ I’m a powerful woman ♪ ♪ Always get what I want ♪ ♪ So don...
gabriel,[man] Can you please state your name? Martin Moreno. But you might know me as… Martinnnnn! I’ve been touring with Gabriel Iglesias for 20-plus yea...
iliza,"[upbeat music playing] [crowd cheering] Cleveland, Ohio! Thank you! Thank you so much. This is so great. This is so nice to be here with you in pu..."
jim,"[Car horn honks] [Audience cheering] [Announcer] Ladies and gentlemen, please welcome to the stage Mr. Jim Jefferies! [Upbeat music playing] Hello..."


In [18]:
# Let's add the comedians' full names as well
full_names=['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Chris Rock', 'Dave Chappelle', 'Deon Cole', 'Fortune Feimster', 'Gabriel Iglesias', 'Iliza Shlesinger', 'Jim Jefferies', 'Joe Rogan', 'John Mulaney', 'Louis C.K.', 'Mike Birbiglia', 'Moses Storm', 'Patton Oswalt', 'Ricky Gervais', 'Tom Segura', 'Trevor Noah']
data_df['full_name'] = full_names
data_df

Unnamed: 0,transcript,full_name
ali,"Ladies and gentlemen, please welcome to the stage: Ali Wong! Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have ...",Ali Wong
anthony,"Thank you. Thank you. Thank you, San Francisco. Thank you so much. So good to be here. People were surprised when I told ’em I was gonna tape my s...",Anthony Jeselnik
bill,"[cheers and applause] All right, thank you! Thank you very much! Thank you. Thank you. Thank you. How are you? What’s going on? Thank you. It’s a ...",Bill Burr
chris,"Ladies and gentlemen… live from the world-famous Apollo Theater… in Harlem, New York. Are you ready? Please welcome Mr. Chris Rock! What’s up… New...",Chris Rock
dave,"This is Dave. He tells dirty jokes for a living. That stare is where most of his hard work happens. It signifies a profound train of thought, the ...",Dave Chappelle
deon,"[indistinct chattering] [woman] Oh, this water is so good. I don’t know why I was so thirsty. But anyway, I feel comfortable now. It feels real go...",Deon Cole
fortune,[upbeat music plays] [audience cheering] [announcer] Please welcome Fortune Feimster! ♪ I’m a powerful woman ♪ ♪ Always get what I want ♪ ♪ So don...,Fortune Feimster
gabriel,[man] Can you please state your name? Martin Moreno. But you might know me as… Martinnnnn! I’ve been touring with Gabriel Iglesias for 20-plus yea...,Gabriel Iglesias
iliza,"[upbeat music playing] [crowd cheering] Cleveland, Ohio! Thank you! Thank you so much. This is so great. This is so nice to be here with you in pu...",Iliza Shlesinger
jim,"[Car horn honks] [Audience cheering] [Announcer] Ladies and gentlemen, please welcome to the stage Mr. Jim Jefferies! [Upbeat music playing] Hello...",Jim Jefferies


In [19]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

### Document-Term Matrix: Example

For many of the techniques we'll be using in future assignment, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's ` CountVectorizer `, where every row will represent a different document and every column will represent a different word.

In addition, with ` CountVectorizer `, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [20]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm



Unnamed: 0,aaaaah,aaah,aah,aaras,abandon,abandoned,abbott,abby,abc,abcs,...,zip,zippers,zombie,zombies,zones,zoning,zoo,zoom,zoomed,zucker
ali,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
anthony,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bill,1,0,0,0,0,0,0,0,0,1,...,0,0,1,1,0,1,0,0,0,0
chris,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
dave,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
deon,0,0,0,0,0,1,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
fortune,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
gabriel,0,1,0,1,0,0,0,0,1,0,...,0,0,0,0,1,0,0,4,0,0
iliza,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
jim,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [22]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))

## Additional Assignments:

1. Can you add an additional regular expression to the clean_text_round2 function to further clean the text?
2. Play around with CountVectorizer's parameters. What is ngram_range? What is min_df and max_df?

In [23]:
"""
--> CountVectorizer is a tool that is used to convert a collection of text documents to a matrix of token counts.

It's ngram_range parameter is used to define the range of n-grams to extract from the text. It is a tuple, where the 
first element represents the minimum value of n and the second element represents the maximum value of n. 
For example, let's say we have the following sentence: "I like to play football"
If ngram_range=(1, 1), it will extract only unigrams (individual words) such as ["I", "like", "to", "play", "football"].
If ngram_range=(2, 2), it will extract only bigrams (two consecutive words) such as ["I like", "like to", "to play", "play football"].
If ngram_range=(1, 2), it will extract both unigrams and bigrams such as ["I", "like", "to", "play", "football", "I like", "like to", "to play", "play football"].
"""

"""
--> min_df parameter is used to ignore terms that have a document frequency strictly lower than the given threshold. 
is used for removing terms that appear too infrequently. For example, let's say we have a corpus of 4 documents:
Document 1: "I like to play football"
Document 2: "I love to play football"
Document 3: "I hate to play football"
Document 4: "I enjoy to play basketball"
If we set min_df=2, the vocabulary will only include words that appear in at least 2 documents. 
In this case, the words "I", "to", "play" and "football" would be included in the vocabulary because they appear 
in at least 2 documents. The word "like" would be ignored because it only appears in 1 document.

Similarly,
--> max_df parameter is used to ignore terms that have a document frequency strictly higher than the given threshold. 
It is used for removing terms that appear too frequently.
For example, if max_df=0.8, it would ignore terms that appear in more than 80% of the documents in the corpus. 
"""

'\n--> min_df parameter is used to ignore terms that have a document frequency strictly lower than the given threshold. \nis used for removing terms that appear too infrequently. For example, let\'s say we have a corpus of 4 documents:\nDocument 1: "I like to play football"\nDocument 2: "I love to play football"\nDocument 3: "I hate to play football"\nDocument 4: "I enjoy to play basketball"\nIf we set min_df=2, the vocabulary will only include words that appear in at least 2 documents. \nIn this case, the words "I", "to", "play" and "football" would be included in the vocabulary because they appear \nin at least 2 documents. The word "like" would be ignored because it only appears in 1 document.\n\nSimilarly,\n--> max_df parameter is used to ignore terms that have a document frequency strictly higher than the given threshold. \nIt is used for removing terms that appear too frequently.\nFor example, if max_df=0.8, it would ignore terms that appear in more than 80% of the documents in t