<a href="https://colab.research.google.com/github/yashashridalvi/text_summarization_and_translation/blob/main/language_translation(project).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Text summarizer, language detection , language translation and also speak (or read) news text file**



News Article Summarization: The process of transforming long news articles into brief, comprehensible summaries. It helps readers quickly grasp the key information without reading the entire article.

In [None]:
#Firstly import python necessary library
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from wordcloud import WordCloud
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Text classification in deep learning
# first import some libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,Dense,Flatten,SimpleRNN, GRU

In [None]:
#Preprocessing
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence

Beautiful Soup simplifies web scraping by parsing HTML and XML, making it easy to extract and manipulate data from web pages in Python programs.

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
# Retrieve page text
url = 'https://www.npr.org/2019/07/10/740387601/university-of-texas-austin-promises-free-tuition-for-low-income-students-in-2020'
news_page = requests.get(url).text

In [None]:
# Turn page into BeautifulSoup object to access HTML tags
soup = BeautifulSoup(news_page)

In [None]:

# Get headline
headline = soup.find('h1').get_text()
print(headline)

University of Texas-Austin Promises Free Tuition For Low-Income Students In 2020


***The main text of the article is surrounded by the "p" tag. This time we’ll have to find all of the "p" tags contained on the page since the paragraphs of the article are each contained in a "p" tag.***


In [None]:
# Get text from all <p> tags.
p_tags = soup.find_all('p')
# Get the text from each of the "p" tags and strip surrounding whitespace.
p_tags_text = [tag.get_text().strip() for tag in p_tags]
p_tags_text

['From',
 'By',
 'Vanessa Romo',
 ',',
 'Claire McInerny',
 'The University of Texas-Austin announced Tuesday it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less a year.\n                \n                    \n                    Jon Herskovitz/Reuters\n                    \n                \nhide caption',
 "Four year colleges and universities have difficulty recruiting talented students from the lower end of the economic spectrum who can't afford to attend such institutions without taking on massive debt. To remedy that — at least in part — the University of Texas-Austin announced it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less per year.",
 "The University of Texas System Board of Regents voted unanimously on Tuesday to establish a $160 million endowment, drawing from the state's Permanent University Fund to begin the program in the fall of 2020.",
 '"Recognizing both the need

In [None]:
# Filter out sentences that contain newline characters '\n' or don't contain periods.
sentence_list = [sentence for sentence in p_tags_text if not '\n' in sentence]
sentence_list = [sentence for sentence in sentence_list if '.' in sentence]
sentence_list


["Four year colleges and universities have difficulty recruiting talented students from the lower end of the economic spectrum who can't afford to attend such institutions without taking on massive debt. To remedy that — at least in part — the University of Texas-Austin announced it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less per year.",
 "The University of Texas System Board of Regents voted unanimously on Tuesday to establish a $160 million endowment, drawing from the state's Permanent University Fund to begin the program in the fall of 2020.",
 '"Recognizing both the need for improved access to higher education and the high value of a UT Austin degree, we are dedicating a distribution from the Permanent University Fund to establish an endowment that will directly benefit students and make their degrees more affordable," Chairman of the Board of Regents Kevin Eltife said after the vote.',
 '"This will benefit students of our gr

In [None]:
# Combine list items into string.
article = ' '.join(sentence_list)

##**Now we Translate the news languages**

In [None]:
#Langdetect is use to detect the language of document text
!pip install langdetect



In [None]:
from langdetect import detect_langs
sentence="Good morning yashashri" # sentence is in english language --> output will be en
print(detect_langs(sentence))

[en:0.5714290721216241, so:0.4285706484072208]


In [None]:
#To translate the languages from one language to another we use googletrans
#!pip install googletrans==4.0.0-rc1


In [None]:
#this is the google translate api
#!pip install googletrans

In [None]:
#Here is the list of languages and the keys that are use to call languages
#example:- for hindi language--> we have to use'hi'
#       :- for english language --> we have to use 'en'
import googletrans
print(googletrans.LANGUAGES)

{'af': 'afrikaans', 'sq': 'albanian', 'am': 'amharic', 'ar': 'arabic', 'hy': 'armenian', 'az': 'azerbaijani', 'eu': 'basque', 'be': 'belarusian', 'bn': 'bengali', 'bs': 'bosnian', 'bg': 'bulgarian', 'ca': 'catalan', 'ceb': 'cebuano', 'ny': 'chichewa', 'zh-cn': 'chinese (simplified)', 'zh-tw': 'chinese (traditional)', 'co': 'corsican', 'hr': 'croatian', 'cs': 'czech', 'da': 'danish', 'nl': 'dutch', 'en': 'english', 'eo': 'esperanto', 'et': 'estonian', 'tl': 'filipino', 'fi': 'finnish', 'fr': 'french', 'fy': 'frisian', 'gl': 'galician', 'ka': 'georgian', 'de': 'german', 'el': 'greek', 'gu': 'gujarati', 'ht': 'haitian creole', 'ha': 'hausa', 'haw': 'hawaiian', 'iw': 'hebrew', 'he': 'hebrew', 'hi': 'hindi', 'hmn': 'hmong', 'hu': 'hungarian', 'is': 'icelandic', 'ig': 'igbo', 'id': 'indonesian', 'ga': 'irish', 'it': 'italian', 'ja': 'japanese', 'jw': 'javanese', 'kn': 'kannada', 'kk': 'kazakh', 'km': 'khmer', 'ko': 'korean', 'ku': 'kurdish (kurmanji)', 'ky': 'kyrgyz', 'lo': 'lao', 'la': 'lat

In [None]:
from googletrans import Translator

In [None]:
translator = Translator()
#sentence ="hello good morning yashu"
#translate_to_hindi = translator.translate(sentence,dest ='hi')
#print(translate_to_hindi)

In [None]:
# Function to translate a sentence
def translate_sentence(article, target_language):
    # Initialize the Translator
    translator = Translator()

    # Translate the sentence to the target language
    translated_sentence = translator.translate(article, dest=target_language).text

    return translated_sentence

# Sentence to be translated
article = ' '.join(sentence_list)

# # Prompt the user to specify the target language Target language code (e.g., 'es' for Spanish)
target_language = input("Enter the target language (e.g., 'es' for Spanish): ")
# Translate the sentence
translated_sentence = translate_sentence(article, target_language)

# Print the translated sentence
print(f"Original Sentence: {article}")
print(f"Translated Sentence: {translated_sentence}")


Enter the target language (e.g., 'es' for Spanish): hi
Original Sentence: Four year colleges and universities have difficulty recruiting talented students from the lower end of the economic spectrum who can't afford to attend such institutions without taking on massive debt. To remedy that — at least in part — the University of Texas-Austin announced it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less per year. The University of Texas System Board of Regents voted unanimously on Tuesday to establish a $160 million endowment, drawing from the state's Permanent University Fund to begin the program in the fall of 2020. "Recognizing both the need for improved access to higher education and the high value of a UT Austin degree, we are dedicating a distribution from the Permanent University Fund to establish an endowment that will directly benefit students and make their degrees more affordable," Chairman of the Board of Regents Kevin Eltif

#**Now to Summerize the News Article**

In [None]:
print(translated_sentence)

चार साल के कॉलेजों और विश्वविद्यालयों को आर्थिक स्पेक्ट्रम के निचले छोर से प्रतिभाशाली छात्रों को भर्ती करने में कठिनाई होती है जो बड़े पैमाने पर ऋण लेने के बिना ऐसे संस्थानों में भाग नहीं ले सकते।यह उपाय करने के लिए कि-कम से कम भाग में-टेक्सास-ऑस्टिन विश्वविद्यालय ने घोषणा की कि यह राज्य के स्नातक को पूर्ण ट्यूशन छात्रवृत्ति की पेशकश कर रहा है, जिनके परिवार प्रति वर्ष $ 65,000 या उससे कम कमाते हैं।यूनिवर्सिटी ऑफ टेक्सास सिस्टम बोर्ड ऑफ रीजेंट्स ने 2020 के पतन में कार्यक्रम शुरू करने के लिए राज्य के स्थायी विश्वविद्यालय फंड से ड्राइंग, $ 160 मिलियन की बंदोबस्ती स्थापित करने के लिए मंगलवार को सर्वसम्मति से मतदान किया।एक यूटी ऑस्टिन की डिग्री का मूल्य, हम एक बंदोबस्ती स्थापित करने के लिए स्थायी विश्वविद्यालय फंड से एक वितरण समर्पित कर रहे हैं जो सीधे छात्रों को लाभान्वित करेगा और अपनी डिग्री को अधिक सस्ती बना देगा, "बोर्ड ऑफ रीजेंट्स केविन एल्टिफ़ के अध्यक्ष केविन एल्टिफ़ ने वोट के बाद कहा।"इससे आने वाले वर्षों के लिए हमारे महान राज्य के छात्रों को लाभ होगा," उन्होंने कहा।बंदोबस्ती-जिसमे

In [None]:
#!pip install transformers

We split the long document into smaller sections based on a delimiter (in this example, we split by newline character '\n')
We use the bert-extractive-summarizer library to summarize each section.
we can adjust the min_length and max_length parameters in the model call to control the length of each section's summary.
Finally, we combine all section summaries to create an overall summary of the entire document.

In [89]:
#!pip install bert-extractive-summarizer

In [90]:
from summarizer import Summarizer

# Article text to be summarized (replace with your long document)
article_text = article

# Split the article into smaller sections (e.g., paragraphs)
sections = article_text.split("\n")  # You may choose a different delimiter

# Create a Summarizer object
model = Summarizer()

# Initialize an empty list to store section summaries
section_summaries = []

# Summarize each section and add it to the list
for section in sections:
    if section.strip():  # Skip empty sections
        summary = model(section, min_length=50, max_length=150)  # Adjust length as needed
        section_summaries.append(summary)

# Combine section summaries into an overall summary
overall_summary = " ".join(section_summaries)

# Print the overall summary
print("Overall Summary:")
print(overall_summary)




Overall Summary:
This will benefit students of our great state for years to come," he added. UT-Austin is among the Texas system's more affordable universities; tuition and fees cost about $10,500. Her mother is a single parent who has been working as a housekeeper and doing other odd jobs to make ends meet. She said she'll only need loans to pay for some of her living expenses. "


Or second method to summarize the text

In [None]:

from transformers import pipeline, BartTokenizer, BartForConditionalGeneration
import spacy

# Initialize the summarization pipeline with the BART model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Initialize spaCy for segmenting the text
nlp = spacy.load("en_core_web_sm")

# Long article text
long_article = article

# Process the long article with spaCy to split it into paragraphs
doc = nlp(long_article)
paragraphs = [str(para) for para in doc.sents]

# Initialize variables for the combined summary and lengths
combined_summary = ""
original_length = len(long_article)
summarized_length = 0

# Summarize each paragraph and combine the summaries
for paragraph in paragraphs:
    # Summarize the paragraph without specifying max_length and min_length
    summary = summarizer(paragraph, do_sample=True)[0]["summary_text"]
    summarized_length += len(summary)
    combined_summary += summary + "\n"


# Print the length of the original and summarized articles
print(f"Original Article Length: {original_length} characters")
print(f"Summarized Article Length: {summarized_length} characters")

# Print the combined summary
print("\nCombined Summary:")
print(combined_summary)


Your max_length is set to 142, but your input_length is only 34. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
Your max_length is set to 142, but your input_length is only 42. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=21)
Your max_length is set to 142, but your input_length is only 42. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=21)
Your max_length is set to 142, but your input_length is only 69. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=34)
Your

Original Article Length: 4594 characters
Summarized Article Length: 9591 characters

Combined Summary:
Four year colleges and universities have difficulty recruiting talented students from the lower end of the economic spectrum who can't afford to attend such institutions without taking on massive debt. Four year colleges have difficulty recruit talented students who can’t afford to attends such institutions. Four-year colleges and University of California, Santa Cruz, has difficulty recruiting talent.
The University of Texas-Austin is offering full tuition scholarships to in-state undergraduates. The scholarships are available to students whose families make $65,000 or less per year. The university will offer the scholarships to students in Texas and Texas-Louisiana. Click here for more information on the scholarships.
The University of Texas System Board of Regents voted unanimously on Tuesday to establish a $160 million endowment. The endowment will be drawn from the state's Permane




##**READ THE SUMMARIZED NEWS (Speak)**



In [None]:
!pip install gTTS



Here we read the news file and then it create audio file in which it speak or read the content of the file.

In [None]:

from gtts import gTTS
import os

# Text you want to convert to speech
text = combined_summary

# Create a gTTS object
tts = gTTS(text, lang='en')

# Save the audio file (optional)
tts.save("output.mp3")

# Play the audio using the default audio player (platform-dependent)
os.system("start output.mp3")  # On Windows



32512