# Text Cleaning and Pre-processing

### Tokenization


In [1]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
data = "This is the first statement. This is 2nd. And this is 3rd. !"
tokens = nltk.sent_tokenize(data)
print(tokens)

['This is the first statement.', 'This is 2nd.', 'And this is 3rd.', '!']


In [3]:
data = "This is the 1st statement"
tokens = nltk.word_tokenize(data)
print(tokens)

['This', 'is', 'the', '1st', 'statement']


### Stemming

In [4]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ['Jumps', 'Jumping', 'jumped', 'jump']
for i in words :
  print(i, ":", ps.stem(i))
print(ps.stem('studies'))

Jumps : jump
Jumping : jump
jumped : jump
jump : jump
studi


### Lemmatization

In [5]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lm = WordNetLemmatizer()
print(lm.lemmatize('studies'))

[nltk_data] Downloading package wordnet to /root/nltk_data...


study


### Stopwords Removal

In [6]:
from nltk.corpus import stopwords
nltk.download('stopwords')
text = "now we will perform stopwords removal. this is a part of noise removal from data."
text_tokens = nltk.word_tokenize(text)
text_final = [word for word in text_tokens if not word in stopwords.words()]
print(text_final)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['perform', 'stopwords', 'removal', '.', 'part', 'noise', 'removal', 'data', '.']


# Word Embeddings and Word Vectorization

### Count Vectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["Today is Tuesday.", "Tommorrow is not and will not be Tuesday", "Day after Tommorrow is Thursday."]
cv = CountVectorizer()
res = cv.fit_transform(corpus)
print(res.toarray())

[[0 0 0 0 1 0 0 1 0 1 0]
 [0 1 1 0 1 2 0 0 1 1 1]
 [1 0 0 1 1 0 1 0 1 0 0]]


### TF-IDF Vectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
tv = TfidfVectorizer()
vectors = tv.fit_transform(corpus)
feature_names = tv.get_feature_names_out()
print("Feature names : ", feature_names)
matrix = vectors.todense()
list_dense = matrix.tolist()
df = pd.DataFrame(list_dense, columns=feature_names)
print(df)

Feature names :  ['after' 'and' 'be' 'day' 'is' 'not' 'thursday' 'today' 'tommorrow'
 'tuesday' 'will']
      after       and        be       day        is       not  thursday  \
0  0.000000  0.000000  0.000000  0.000000  0.425441  0.000000  0.000000   
1  0.000000  0.342884  0.342884  0.000000  0.202513  0.685767  0.000000   
2  0.504611  0.000000  0.000000  0.504611  0.298032  0.000000  0.504611   

      today  tommorrow   tuesday      will  
0  0.720333   0.000000  0.547832  0.000000  
1  0.000000   0.260772  0.260772  0.342884  
2  0.000000   0.383770  0.000000  0.000000  


# Text summarization

In [108]:
import nltk
nltk.download('punkt')
data = """
In the intricate dance of industrial progress and the pursuit of modern conveniences, humanity finds itself ensnared in a web of pollutants that stretches far beyond the immediate horizon. Pollution, in its myriad forms, has emerged as a defining challenge of our era, weaving its tendrils through the air we breathe, the water we rely on, and the delicate ecosystems that sustain life on Earth.

At the heart of this environmental conundrum lies the combustion of fossil fuels—a practice synonymous with the ascent of industrialization. Power plants, factories, and vehicles collectively emit a cocktail of pollutants, including carbon dioxide, sulfur dioxide, and nitrogen oxides, into the atmosphere. This atmospheric quilt of pollutants not only contributes to climate change but also poses direct threats to human health, linking air pollution to respiratory diseases, cardiovascular issues, and a host of other ailments.

The specter of pollution extends beneath the Earth's surface and across its waterways. Rivers, once pristine arteries of life, bear witness to the callous dumping of industrial effluents, agricultural runoff, and untreated sewage. The cumulative impact is a stark transformation of once-clear waters into polluted streams, adversely affecting aquatic life and jeopardizing the safety of water supplies for human communities.

In the realm of solid waste, the proliferation of plastics stands as a testament to the convenience-driven choices of modern society. Single-use plastics, designed for transience, linger in the environment for centuries, accumulating in landfills and infiltrating oceans. Marine life becomes entangled in plastic debris, and the ocean's depths are marred by floating islands of refuse. Microplastics, imperceptible to the human eye, permeate ecosystems, raising concerns about their potential impact on wildlife and human health.

Furthermore, pollution is not confined to physical manifestations alone; it infiltrates the very fabric of ecosystems, disrupting biodiversity and triggering a chain reaction of ecological consequences. Pesticides, a byproduct of agricultural intensification, seep into soil and water, affecting not only targeted pests but also non-target species, including beneficial insects and pollinators crucial for food production. The intricate dance of predator-prey relationships is disrupted, setting off a cascade that reverberates across trophic levels.

The silent victims of pollution are not solely the charismatic megafauna but also the humble organisms that form the foundation of ecosystems. Coral reefs, the vibrant underwater cities teeming with biodiversity, face the dual onslaught of warming oceans and chemical pollutants. Coral bleaching, a phenomenon triggered by elevated sea temperatures, threatens these underwater marvels, highlighting the interconnectedness of environmental challenges.

However, within this environmental crisis lies an opportunity for transformative change. The global community, armed with the knowledge of our environmental impact, is gradually shifting towards sustainable alternatives, renewable energy sources, and circular economies. Governments, industries, and individuals are increasingly recognizing the imperative of adopting eco-friendly practices and embracing a regenerative relationship with our planet.

The battle against pollution is not only a fight for the health of our ecosystems but a testament to our shared responsibility as custodians of this delicate blue planet. Governments must enact and enforce stringent environmental regulations, industries must innovate and adopt sustainable practices, and individuals must make conscious choices in their daily lives to reduce their ecological footprint.

Environmental stewardship is no longer a niche concern but a mainstream imperative that transcends borders and ideologies. The unraveling tapestry of pollution demands a global commitment to forge a path toward a cleaner, healthier, and harmonious coexistence with nature. In this collective endeavor, the preservation of our planet's ecological integrity rests, and only through unified action can we hope to secure a sustainable future for generations to come.
"""

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [109]:
sentences = nltk.sent_tokenize(data)
size_doc = len(sentences)

In [110]:
fMatrix = {}   #frequency matrix storing the frequency of each word in a sentence except for stopwords
sw = set(stopwords.words("english"))
ps = PorterStemmer()

for t in sentences:
    sent = {}
    words = nltk.word_tokenize(t)
    for word in words:
        word = word.lower()
        # word = ps.stem(word)
        if word in sw:
            continue

        if word in sent:
            sent[word] += 1
        else:
            sent[word] = 1

    fMatrix[t[:20]] = sent

keys = list(fMatrix.keys())
print(keys)
values = list(fMatrix.values())
items = list(fMatrix.items())
print(values)
print(len(keys))
print(len(values))


['\nIn the intricate da', 'Pollution, in its my', 'At the heart of this', 'Power plants, factor', 'This atmospheric qui', 'The specter of pollu', 'Rivers, once pristin', 'The cumulative impac', 'In the realm of soli', 'Single-use plastics,', 'Marine life becomes ', 'Microplastics, imper', 'Furthermore, polluti', 'Pesticides, a byprod', 'The intricate dance ', 'The silent victims o', 'Coral reefs, the vib', 'Coral bleaching, a p', 'However, within this', 'The global community', 'Governments, industr', 'The battle against p', 'Governments must ena', 'Environmental stewar', 'The unraveling tapes', 'In this collective e']
[{'intricate': 1, 'dance': 1, 'industrial': 1, 'progress': 1, 'pursuit': 1, 'modern': 1, 'conveniences': 1, ',': 1, 'humanity': 1, 'finds': 1, 'ensnared': 1, 'web': 1, 'pollutants': 1, 'stretches': 1, 'far': 1, 'beyond': 1, 'immediate': 1, 'horizon': 1, '.': 1}, {'pollution': 1, ',': 5, 'myriad': 1, 'forms': 1, 'emerged': 1, 'defining': 1, 'challenge': 1, 'era': 1, 'weavi

In [111]:
tfMatrix = {}   #TF matrix to store the (Number of times term appears in a sentences) / (Total number of terms in the sentences)
for sent, fTable in fMatrix.items():
    tf_table = {}
    words_in_sent = len(fTable)
    for word, occurence in fTable.items():
        tf_table[word] = occurence / words_in_sent
    tfMatrix[sent] = tf_table

keys = list(tfMatrix.keys())
print(keys)
values = list(tfMatrix.values())
items = list(tfMatrix.items())
print(values)
print(len(keys))
print(len(values))

['\nIn the intricate da', 'Pollution, in its my', 'At the heart of this', 'Power plants, factor', 'This atmospheric qui', 'The specter of pollu', 'Rivers, once pristin', 'The cumulative impac', 'In the realm of soli', 'Single-use plastics,', 'Marine life becomes ', 'Microplastics, imper', 'Furthermore, polluti', 'Pesticides, a byprod', 'The intricate dance ', 'The silent victims o', 'Coral reefs, the vib', 'Coral bleaching, a p', 'However, within this', 'The global community', 'Governments, industr', 'The battle against p', 'Governments must ena', 'Environmental stewar', 'The unraveling tapes', 'In this collective e']
[{'intricate': 0.05263157894736842, 'dance': 0.05263157894736842, 'industrial': 0.05263157894736842, 'progress': 0.05263157894736842, 'pursuit': 0.05263157894736842, 'modern': 0.05263157894736842, 'conveniences': 0.05263157894736842, ',': 0.05263157894736842, 'humanity': 0.05263157894736842, 'finds': 0.05263157894736842, 'ensnared': 0.05263157894736842, 'web': 0.052631578

In [122]:
# how many sentences contain a word
doc_per_words_mat = {}
for sent, fTable in fMatrix.items():
    for word in fTable.keys():
        if word in doc_per_words_mat:
            doc_per_words_mat[word] += 1
        else:
            doc_per_words_mat[word] = 1

print(doc_per_words_mat)

{'intricate': 2, 'dance': 2, 'industrial': 2, 'progress': 1, 'pursuit': 1, 'modern': 2, 'conveniences': 1, ',': 21, 'humanity': 1, 'finds': 1, 'ensnared': 1, 'web': 1, 'pollutants': 4, 'stretches': 1, 'far': 1, 'beyond': 1, 'immediate': 1, 'horizon': 1, '.': 26, 'pollution': 7, 'myriad': 1, 'forms': 1, 'emerged': 1, 'defining': 1, 'challenge': 1, 'era': 1, 'weaving': 1, 'tendrils': 1, 'air': 2, 'breathe': 1, 'water': 3, 'rely': 1, 'delicate': 2, 'ecosystems': 5, 'sustain': 1, 'life': 4, 'earth': 2, 'heart': 1, 'environmental': 6, 'conundrum': 1, 'lies': 2, 'combustion': 1, 'fossil': 1, 'fuels—a': 1, 'practice': 1, 'synonymous': 1, 'ascent': 1, 'industrialization': 1, 'power': 1, 'plants': 1, 'factories': 1, 'vehicles': 1, 'collectively': 1, 'emit': 1, 'cocktail': 1, 'including': 2, 'carbon': 1, 'dioxide': 1, 'sulfur': 1, 'nitrogen': 1, 'oxides': 1, 'atmosphere': 1, 'atmospheric': 1, 'quilt': 1, 'contributes': 1, 'climate': 1, 'change': 2, 'also': 3, 'poses': 1, 'direct': 1, 'threats': 

In [113]:
import math
idfMatrix = {}   #IDF matrix to store the log_e(Total number of sentences / Number of sentences with term t in it)

for sent, fTable in fMatrix.items():
    idfTable = {}
    for word in fTable.keys():
        idfTable[word] = math.log10(size_doc / float(doc_per_words_mat[word]))
    idfMatrix[sent] = idfTable

keys = list(idfMatrix.keys())
print(keys)
values = list(idfMatrix.values())
items = list(idfMatrix.items())
print(values)
print(len(keys))
print(len(values))

['\nIn the intricate da', 'Pollution, in its my', 'At the heart of this', 'Power plants, factor', 'This atmospheric qui', 'The specter of pollu', 'Rivers, once pristin', 'The cumulative impac', 'In the realm of soli', 'Single-use plastics,', 'Marine life becomes ', 'Microplastics, imper', 'Furthermore, polluti', 'Pesticides, a byprod', 'The intricate dance ', 'The silent victims o', 'Coral reefs, the vib', 'Coral bleaching, a p', 'However, within this', 'The global community', 'Governments, industr', 'The battle against p', 'Governments must ena', 'Environmental stewar', 'The unraveling tapes', 'In this collective e']
[{'intricate': 1.1139433523068367, 'dance': 1.1139433523068367, 'industrial': 1.1139433523068367, 'progress': 1.414973347970818, 'pursuit': 1.414973347970818, 'modern': 1.1139433523068367, 'conveniences': 1.414973347970818, ',': 0.09275405323689871, 'humanity': 1.414973347970818, 'finds': 1.414973347970818, 'ensnared': 1.414973347970818, 'web': 1.414973347970818, 'polluta

In [114]:
tfidfMatrix = {}   #TF-IDF matrix to store tf*idf
for (sent1, fTable1), (sent2, fTable2) in zip(tfMatrix.items(), idfMatrix.items()):
    tf_idf_table = {}
    for (word1, value1), (word2, value2) in zip(fTable1.items(),fTable2.items()):
        tf_idf_table[word1] = float(value1 * value2)
    tfidfMatrix[sent1] = tf_idf_table

print(tfidfMatrix)

{'\nIn the intricate da': {'intricate': 0.05862859748983351, 'dance': 0.05862859748983351, 'industrial': 0.05862859748983351, 'progress': 0.07447228147214831, 'pursuit': 0.07447228147214831, 'modern': 0.05862859748983351, 'conveniences': 0.07447228147214831, ',': 0.004881792275626248, 'humanity': 0.07447228147214831, 'finds': 0.07447228147214831, 'ensnared': 0.07447228147214831, 'web': 0.07447228147214831, 'pollutants': 0.042784913507518715, 'stretches': 0.07447228147214831, 'far': 0.07447228147214831, 'beyond': 0.07447228147214831, 'immediate': 0.07447228147214831, 'horizon': 0.07447228147214831, '.': 0.0}, 'Pollution, in its my': {'pollution': 0.02849376539782806, ',': 0.023188513309224678, 'myriad': 0.0707486673985409, 'forms': 0.0707486673985409, 'emerged': 0.0707486673985409, 'defining': 0.0707486673985409, 'challenge': 0.0707486673985409, 'era': 0.0707486673985409, 'weaving': 0.0707486673985409, 'tendrils': 0.0707486673985409, 'air': 0.05569716761534184, 'breathe': 0.070748667398

In [123]:
sentValue = {}   #calculating the score of each sentence
for sent, fTable in tfidfMatrix.items():
    score_per_sent = 0
    for word, count in fTable.items():
        score_per_sent += count  #adding the frequency of every non stopword

    sentValue[sent] = score_per_sent / len(fTable)  #dividing by the total no of words in that sentence

print(sentValue)

{'\nIn the intricate da': 0.06188676175832941, 'Pollution, in its my': 0.05601737813067344, 'At the heart of this': 0.10059392765021787, 'Power plants, factor': 0.07713843564000879, 'This atmospheric qui': 0.05011743039748337, 'The specter of pollu': 0.10810480845675481, 'Rivers, once pristin': 0.07412704336673095, 'The cumulative impac': 0.05806955141390172, 'In the realm of soli': 0.08552272720863889, 'Single-use plastics,': 0.0901828934678209, 'Marine life becomes ': 0.07736989480359956, 'Microplastics, imper': 0.08196663602100336, 'Furthermore, polluti': 0.06071422639125702, 'Pesticides, a byprod': 0.055399071644489005, 'The intricate dance ': 0.08730396978653227, 'The silent victims o': 0.08851178033479218, 'Coral reefs, the vib': 0.07105060548469282, 'Coral bleaching, a p': 0.07307686636558805, 'However, within this': 0.10032329595291838, 'The global community': 0.06474433958331399, 'Governments, industr': 0.07376840176468827, 'The battle against p': 0.08212609248835653, 'Governm

In [116]:
sum = 0
for i in sentValue:
    sum += sentValue[i]

threshold = (sum / len(sentValue))   #average of the each sentence score

print(threshold)

0.07615671916203157


In [124]:
summary = ''

for sentence in sentences:
    if sentence[:20] in sentValue and sentValue[sentence[:20]] >= (threshold):    #considering the threshold to be the average value
        summary += " " + sentence

print(summary)

 At the heart of this environmental conundrum lies the combustion of fossil fuels—a practice synonymous with the ascent of industrialization. The specter of pollution extends beneath the Earth's surface and across its waterways. In the realm of solid waste, the proliferation of plastics stands as a testament to the convenience-driven choices of modern society. Single-use plastics, designed for transience, linger in the environment for centuries, accumulating in landfills and infiltrating oceans. The intricate dance of predator-prey relationships is disrupted, setting off a cascade that reverberates across trophic levels. The silent victims of pollution are not solely the charismatic megafauna but also the humble organisms that form the foundation of ecosystems. However, within this environmental crisis lies an opportunity for transformative change. Environmental stewardship is no longer a niche concern but a mainstream imperative that transcends borders and ideologies.
