**What Are N-Grams(ngrams)?**

N-grams are continuous sequences of words or symbols, or tokens in a document. In technical terms, they can be defined as the neighboring sequences of items in a document. They come into play when we deal with text data in NLP (Natural Language Processing) tasks. They have a wide range of applications, like language models, semantic features, spelling correction, machine translation, text mining, etc.

Example of N-Grams
Let’s understand n-grams practically with the help of the following sample sentence:

“I reside in Bengaluru”.

|SL.No.|Type of n-gram|Generated n-grams|
|-------|----|------------|        
|1|Unigram|[“I”,”reside”,”in”,”Bengaluru”]|
|2|Bigram|	         [“I reside”,”reside in”,”in Bengaluru”]
|3|	      Trigram|	         [“I reside in”, “reside in Bengaluru”]


In [3]:
from nltk import ngrams
sentence = 'I reside in Bengaluru.'
n = 1
unigrams = ngrams(sentence.split(), n)
for grams in unigrams:
    print (grams)

('I',)
('reside',)
('in',)
('Bengaluru.',)


In [4]:
from nltk import ngrams
sentence = 'I reside in Bengaluru.'
n = 2
unigrams = ngrams(sentence.split(), n)
for grams in unigrams:
    print (grams)

('I', 'reside')
('reside', 'in')
('in', 'Bengaluru.')


In [5]:
from nltk import ngrams
sentence = 'I reside in Bengaluru.'
n = 3
unigrams = ngrams(sentence.split(), n)
for grams in unigrams:
    print (grams)

('I', 'reside', 'in')
('reside', 'in', 'Bengaluru.')


From the table above, it’s clear that unigram means taking only one word at a time, bigram means taking two words at a time, and trigram means taking three words at a time.

In [None]:
#!pip install youtube-transcript-api

In [2]:
import youtube_transcript_api
from youtube_transcript_api import YouTubeTranscriptApi
import nltk
import re
from nltk.corpus import stopwords
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
link="https://www.youtube.com/watch?v=ufy2hHdfOEs"
unique_id = link.split("=")[-1]
sub = YouTubeTranscriptApi.get_transcript(unique_id)  
subtitle = " ".join([x['text'] for x in sub])

In [4]:
subtitle

"yes yeah it's great thank you so before going to start I will give you some quick points and uh first thing is my voice is clear that is very happy to hear and this session is regarding to machine learning operations that is ml Ops so this session is both Theory and practicals okay Theory and practical practically about end to end flow end to end flow of mlos mlos so that is the first thing it is a both includes Theory as well as practicals and this session is about two hours I'm expecting for two hours it might be good for 2.5 hours also because this is a very very important topic we need to connect with cloud and we need to Showcase that how it is works and I will take all the question and answers I will give some pass in Middle as well as at the end so all q and a session you can take at n most probably you can note your questions I will give each and every one answers okay and first of all I need one confirmation how many of you know about data science data science and machine lea

In [55]:
len(subtitle)

99689

In [35]:
from nltk.tokenize import sent_tokenize

subtitle = subtitle.replace("n","")
sentences = sent_tokenize(subtitle)

In [36]:
sentences

["yes yeah it's great thak you so before goig to start I will give you some quick poits ad uh first thig is my voice is clear that is very happy to hear ad this sessio is regardig to machie learig operatios that is ml Ops so this sessio is both Theory ad practicals okay Theory ad practical practically about ed to ed flow ed to ed flow of mlos mlos so that is the first thig it is a both icludes Theory as well as practicals ad this sessio is about two hours I'm expectig for two hours it might be good for 2.5 hours also because this is a very very importat topic we eed to coect with cloud ad we eed to Showcase that how it is works ad I will take all the questio ad aswers I will give some pass i Middle as well as at the ed so all q ad a sessio you ca take at  most probably you ca ote your questios I will give each ad every oe aswers okay ad first of all I eed oe cofirmatio how may of you kow about data sciece data sciece ad machie learig three parts oe preset four five yeah it's a good umb

In [37]:
organized_sent = {k:v for v,k in enumerate(sentences)}
organized_sent

{"yes yeah it's great thak you so before goig to start I will give you some quick poits ad uh first thig is my voice is clear that is very happy to hear ad this sessio is regardig to machie learig operatios that is ml Ops so this sessio is both Theory ad practicals okay Theory ad practical practically about ed to ed flow ed to ed flow of mlos mlos so that is the first thig it is a both icludes Theory as well as practicals ad this sessio is about two hours I'm expectig for two hours it might be good for 2.5 hours also because this is a very very importat topic we eed to coect with cloud ad we eed to Showcase that how it is works ad I will take all the questio ad aswers I will give some pass i Middle as well as at the ed so all q ad a sessio you ca take at  most probably you ca ote your questios I will give each ad every oe aswers okay ad first of all I eed oe cofirmatio how may of you kow about data sciece data sciece ad machie learig three parts oe preset four five yeah it's a good umb

In [38]:
tf_idf = TfidfVectorizer(min_df=2, 
                                    strip_accents='unicode',
                                    max_features=None,
                                    lowercase = True,
                                    token_pattern=r'w{1,}',
                                    ngram_range=(1, 3), 
                                    use_idf=1,
                                    smooth_idf=1,
                                    sublinear_tf=1,
                                    stop_words = 'english')

In [40]:
import numpy as np
sentence_vectors = tf_idf.fit_transform(sentences)
sent_scores = np.array(sentence_vectors.sum(axis=1)).ravel()

In [42]:
sentence_vectors

<3x3 sparse matrix of type '<class 'numpy.float64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [43]:
sentence_vectors.toarray()

array([[0.57740768, 0.57735028, 0.57729284],
       [0.5774808 , 0.57735033, 0.57721964],
       [0.57756455, 0.57735043, 0.57713575]])

In [41]:
sent_scores

array([1.7320508 , 1.73205078, 1.73205073])

In [48]:
len(sentences)

3

In [49]:
sentences

["yes yeah it's great thak you so before goig to start I will give you some quick poits ad uh first thig is my voice is clear that is very happy to hear ad this sessio is regardig to machie learig operatios that is ml Ops so this sessio is both Theory ad practicals okay Theory ad practical practically about ed to ed flow ed to ed flow of mlos mlos so that is the first thig it is a both icludes Theory as well as practicals ad this sessio is about two hours I'm expectig for two hours it might be good for 2.5 hours also because this is a very very importat topic we eed to coect with cloud ad we eed to Showcase that how it is works ad I will take all the questio ad aswers I will give some pass i Middle as well as at the ed so all q ad a sessio you ca take at  most probably you ca ote your questios I will give each ad every oe aswers okay ad first of all I eed oe cofirmatio how may of you kow about data sciece data sciece ad machie learig three parts oe preset four five yeah it's a good umb

In [47]:
#Now let’s find out the top N sentences that have a larger score.

N = 3
top_n_sentences = [sentences[index] for index in np.argsort(sent_scores, axis=0)[::-1][:N]]
top_n_sentences[2]

"I'm just clickig this whe I clickig this what they are sayig traffic was successfully tuel to the grok aget but aget failed to establish ow we created a tuel successfully but this aget is ot givig permissio to create a UI why this is Aget is ot give permissio because you should ot provide a autheticated ID the aget does't kow the aget aget does't kow does't kow your ID your ID so that's why he is ot givig permissio he is ot givig permissio simple guys first what we are goig we are goig to a compay called grok to make a tuel from which place to which place Google collab Notebook 2 Google collab Notebook 2 your ml flow UI ml flow UI so we created a tuel but the tuel is ot givig permissio because we does't provide our ID ow what we will do we will provide our ID ow so get help with this error ow it will reach to here you just click o the logi just click o the logi ow whe you log i it will ask your email ID ad it will create with homecare.alagoi real gmail.com okay it wheever you log i it

In [50]:
# mapping the scored sentences with their indexes as in the subtitle
mapped_sentences = [(sentence,organized_sent[sentence]) for sentence in top_n_sentences]
# Ordering the top-n sentences in their original order
mapped_sentences = sorted(mapped_sentences, key = lambda x: x[1])
ordered_sentences = [element[0] for element in mapped_sentences]
# joining the ordered sentence
summary = " ".join(ordered_sentences)

In [51]:
summary

"yes yeah it's great thak you so before goig to start I will give you some quick poits ad uh first thig is my voice is clear that is very happy to hear ad this sessio is regardig to machie learig operatios that is ml Ops so this sessio is both Theory ad practicals okay Theory ad practical practically about ed to ed flow ed to ed flow of mlos mlos so that is the first thig it is a both icludes Theory as well as practicals ad this sessio is about two hours I'm expectig for two hours it might be good for 2.5 hours also because this is a very very importat topic we eed to coect with cloud ad we eed to Showcase that how it is works ad I will take all the questio ad aswers I will give some pass i Middle as well as at the ed so all q ad a sessio you ca take at  most probably you ca ote your questios I will give each ad every oe aswers okay ad first of all I eed oe cofirmatio how may of you kow about data sciece data sciece ad machie learig three parts oe preset four five yeah it's a good umbe

In [56]:
len(summary)

94133

In [5]:
import os
os.getcwd()

'C:\\Users\\omkar\\OneDrive\\Documents\\Data science\\Naresh IT\\Data science\\Batch-2_July3\\NLP'