<a href="https://colab.research.google.com/github/sumant1122/PythonBootcamp/blob/main/Session4_PythonBootcamp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP : Summarization

Import required libraries

In [11]:
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')
import re
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Read the file

In [12]:
df = pd.read_csv("tennis.csv")
df.head()

Unnamed: 0,article_id,article_text,source
0,1,Maria Sharapova has basically no friends as te...,https://www.tennisworldusa.org/tennis/news/Mar...
1,2,"BASEL, Switzerland (AP), Roger Federer advance...",http://www.tennis.com/pro-game/2018/10/copil-s...
2,3,Roger Federer has revealed that organisers of ...,https://scroll.in/field/899938/tennis-roger-fe...
3,4,Kei Nishikori will try to end his long losing ...,http://www.tennis.com/pro-game/2018/10/nishiko...
4,5,"Federer, 37, first broke through on tour over ...",https://www.express.co.uk/sport/tennis/1036101...


In [13]:
df['article_text'][1]

"BASEL, Switzerland (AP), Roger Federer advanced to the 14th Swiss Indoors final of his career by beating seventh-seeded Daniil Medvedev 6-1, 6-4 on Saturday. Seeking a ninth title at his hometown event, and a 99th overall, Federer will play 93th-ranked Marius Copil on Sunday. Federer dominated the 20th-ranked Medvedev and had his first match-point chance to break serve again at 5-1. He then dropped his serve to love, and let another match point slip in Medvedev's next service game by netting a backhand. He clinched on his fourth chance when Medvedev netted from the baseline. Copil upset expectations of a Federer final against Alexander Zverev in a 6-3, 6-7 (6), 6-4 win over the fifth-ranked German in the earlier semifinal. The Romanian aims for a first title after arriving at Basel without a career win over a top-10 opponent. Copil has two after also beating No. 6 Marin Cilic in the second round. Copil fired 26 aces past Zverev and never dropped serve, clinching after 2 1/2 hours with

Tokenize the sentences

In [15]:
nltk.download('punkt_tab')
nltk.download('stopwords')
from nltk.tokenize import sent_tokenize
sentences = []
for s in df['article_text']:
    sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x]

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


##Glove Word Representation

GloVe, short for Global Vectors for Word Representation, is an unsupervised learning algorithm used to generate word embeddings by analyzing global word-word co-occurrence statistics from a large text corpus.
 This method constructs a co-occurrence matrix, where each element represents how often a pair of words appears together within a specific context window.
 The algorithm then factorizes this matrix to obtain lower-dimensional vector representations for each word, capturing both semantic and syntactic relationships.



In [16]:
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
clean_sentences = [s.lower() for s in clean_sentences]
stop_words = stopwords.words('english')
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

vectors for the sentences

In [17]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

similarities between the sentences

In [18]:
sim_mat = np.zeros([len(sentences), len(sentences)])
from sklearn.metrics.pairwise import cosine_similarity
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

Creating a graph of about similarities

In [19]:
import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

##Output

In [20]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
for i in range(5):
  print("ARTICLE:")
  print(df['article_text'][i])
  print('\n')
  print("SUMMARY:")
  print(ranked_sentences[i][1])
  print('\n')

ARTICLE:
Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because you're in t

# HuggingFace Summarization


In [23]:
text_example = '''The tower is 324 meters (1,063 ft) tall, about the same height
as an 81-storey building, and the tallest structure in Paris. Its base is square,
measuring 125 meters (410 ft) on each side. During its construction, the Eiffel
Tower surpassed the Washington Monument to become the tallest man-made structure
in the world, a title it held for 41 years until the Chrysler Building in New York
City was finished in 1930. It was the first structure to reach a height of 300 meters.
Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is
now taller than the Chrysler Building by 5.2 meters (17 ft). Excluding transmitters,
the Eiffel Tower is the second tallest free-standing structure in France
after the Millau Viaduct.'''

In [None]:
from transformers import pipeline
summarizer = pipeline("summarization", model = "google/pegasus-cnn_dailymail")
summarizer(text_example)

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

In [27]:
from transformers import pipeline
summarizer = pipeline("summarization", model = "facebook/bart-large-cnn")
summarizer(text_example)

Device set to use cuda:0


[{'summary_text': 'The tower is 324 meters (1,063 ft) tall, about the same height as an 81-storey building. Its base is square,  \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0measuring 125 meters (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world.'}]

# Classification

In [2]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "There is a sale in supermarket",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


{'sequence': 'There is a sale in supermarket',
 'labels': ['business', 'education', 'politics'],
 'scores': [0.9664040803909302, 0.017405562102794647, 0.01619039662182331]}

In [4]:
pipe = pipeline("text-classification")
pipe("I hate the food here")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'NEGATIVE', 'score': 0.9994348883628845}]

In [5]:
from transformers import pipeline

classifier = pipeline(model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
#classifier("This movie is disgustingly good !")

classifier("Director tried too much.")

Device set to use cuda:0


[{'label': 'NEGATIVE', 'score': 0.9963768124580383}]

In [6]:
from transformers import pipeline

generator = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/724M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "In this course, we will teach you how to make a simple but powerful video game from scratch. You'll learn all the essential components, including movement, interaction, and sound.\n\n**Step 1: Understand Human Interaction Basics**\n\nBefore we begin, let's cover some fundamental concepts:\n\n* **Human Interaction:** Humans interact with each other using verbal (talking) and non-verbal cues (body language, facial expressions).\n* **Communication:** When interacting with humans, remember to speak clearly and listen attentively.\n* **Body Language:** Pay attention to posture, eye contact, and gestures when communicating with others.\n\nNow that we've covered these basics, let's dive into creating our first video game!\n\n**Step 2: Gather Materials**\n\nTo start, you'll need the following materials:\n\n* A computer or tablet\n* Internet access\n* Video editing software or a digital camera (optional)\n* Game development tools (e.g., Adobe After Effects, Unreal Engine)\n