<a href="https://colab.research.google.com/github/yashashridalvi/text_summarization_and_translation/blob/main/language_translation(project).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Text summarizer, language detection , language translation and also speak (or read) news text file**



News Article Summarization: The process of transforming long news articles into brief, comprehensible summaries. It helps readers quickly grasp the key information without reading the entire article.

In [None]:
#Firstly import python necessary library
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from wordcloud import WordCloud
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Text classification in deep learning
# first import some libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,Dense,Flatten,SimpleRNN, GRU

In [None]:
#Preprocessing
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence

Beautiful Soup simplifies web scraping by parsing HTML and XML, making it easy to extract and manipulate data from web pages in Python programs.

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
# Retrieve page text
url = 'https://www.npr.org/2019/07/10/740387601/university-of-texas-austin-promises-free-tuition-for-low-income-students-in-2020'
news_page = requests.get(url).text

In [None]:
# Turn page into BeautifulSoup object to access HTML tags
soup = BeautifulSoup(news_page)

In [None]:

# Get headline
headline = soup.find('h1').get_text()
print(headline)

University of Texas-Austin Promises Free Tuition For Low-Income Students In 2020


***The main text of the article is surrounded by the "p" tag. This time we’ll have to find all of the "p" tags contained on the page since the paragraphs of the article are each contained in a "p" tag.***


In [None]:
# Get text from all <p> tags.
p_tags = soup.find_all('p')
# Get the text from each of the "p" tags and strip surrounding whitespace.
p_tags_text = [tag.get_text().strip() for tag in p_tags]
p_tags_text

['From',
 'By',
 'Vanessa Romo',
 ',',
 'Claire McInerny',
 'The University of Texas-Austin announced Tuesday it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less a year.\n                \n                    \n                    Jon Herskovitz/Reuters\n                    \n                \nhide caption',
 "Four year colleges and universities have difficulty recruiting talented students from the lower end of the economic spectrum who can't afford to attend such institutions without taking on massive debt. To remedy that — at least in part — the University of Texas-Austin announced it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less per year.",
 "The University of Texas System Board of Regents voted unanimously on Tuesday to establish a $160 million endowment, drawing from the state's Permanent University Fund to begin the program in the fall of 2020.",
 '"Recognizing both the need

In [None]:
# Filter out sentences that contain newline characters '\n' or don't contain periods.
sentence_list = [sentence for sentence in p_tags_text if not '\n' in sentence]
sentence_list = [sentence for sentence in sentence_list if '.' in sentence]
sentence_list


["Four year colleges and universities have difficulty recruiting talented students from the lower end of the economic spectrum who can't afford to attend such institutions without taking on massive debt. To remedy that — at least in part — the University of Texas-Austin announced it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less per year.",
 "The University of Texas System Board of Regents voted unanimously on Tuesday to establish a $160 million endowment, drawing from the state's Permanent University Fund to begin the program in the fall of 2020.",
 '"Recognizing both the need for improved access to higher education and the high value of a UT Austin degree, we are dedicating a distribution from the Permanent University Fund to establish an endowment that will directly benefit students and make their degrees more affordable," Chairman of the Board of Regents Kevin Eltife said after the vote.',
 '"This will benefit students of our gr

In [None]:
# Combine list items into string.
article = ' '.join(sentence_list)

##**Now we Translate the news languages**

In [None]:
#Langdetect is use to detect the language of document text
!pip install langdetect



In [None]:
from langdetect import detect_langs
sentence="Good morning yashashri" # sentence is in english language --> output will be en
print(detect_langs(sentence))

[en:0.9999987474672236]


In [None]:
#To translate the languages from one language to another we use googletrans
#!pip install googletrans==4.0.0-rc1


In [None]:
#this is the google translate api
#!pip install googletrans

In [None]:
#Here is the list of languages and the keys that are use to call languages
#example:- for hindi language--> we have to use'hi'
#       :- for english language --> we have to use 'en'
import googletrans
print(googletrans.LANGUAGES)

{'af': 'afrikaans', 'sq': 'albanian', 'am': 'amharic', 'ar': 'arabic', 'hy': 'armenian', 'az': 'azerbaijani', 'eu': 'basque', 'be': 'belarusian', 'bn': 'bengali', 'bs': 'bosnian', 'bg': 'bulgarian', 'ca': 'catalan', 'ceb': 'cebuano', 'ny': 'chichewa', 'zh-cn': 'chinese (simplified)', 'zh-tw': 'chinese (traditional)', 'co': 'corsican', 'hr': 'croatian', 'cs': 'czech', 'da': 'danish', 'nl': 'dutch', 'en': 'english', 'eo': 'esperanto', 'et': 'estonian', 'tl': 'filipino', 'fi': 'finnish', 'fr': 'french', 'fy': 'frisian', 'gl': 'galician', 'ka': 'georgian', 'de': 'german', 'el': 'greek', 'gu': 'gujarati', 'ht': 'haitian creole', 'ha': 'hausa', 'haw': 'hawaiian', 'iw': 'hebrew', 'he': 'hebrew', 'hi': 'hindi', 'hmn': 'hmong', 'hu': 'hungarian', 'is': 'icelandic', 'ig': 'igbo', 'id': 'indonesian', 'ga': 'irish', 'it': 'italian', 'ja': 'japanese', 'jw': 'javanese', 'kn': 'kannada', 'kk': 'kazakh', 'km': 'khmer', 'ko': 'korean', 'ku': 'kurdish (kurmanji)', 'ky': 'kyrgyz', 'lo': 'lao', 'la': 'lat

In [None]:
from googletrans import Translator

In [None]:
translator = Translator()
#sentence ="hello good morning yashu"
#translate_to_hindi = translator.translate(sentence,dest ='hi')
#print(translate_to_hindi)

In [None]:
# Function to translate a sentence
def translate_sentence(article, target_language):
    # Initialize the Translator
    translator = Translator()

    # Translate the sentence to the target language
    translated_sentence = translator.translate(article, dest=target_language).text

    return translated_sentence

# Sentence to be translated
article = ' '.join(sentence_list)

# # Prompt the user to specify the target language Target language code (e.g., 'es' for Spanish)
target_language = input("Enter the target language (e.g., 'es' for Spanish): ")
# Translate the sentence
translated_sentence = translate_sentence(article, target_language)

# Print the translated sentence
print(f"Original Sentence: {article}")
print(f"Translated Sentence: {translated_sentence}")


Enter the target language (e.g., 'es' for Spanish): mr
Original Sentence: Four year colleges and universities have difficulty recruiting talented students from the lower end of the economic spectrum who can't afford to attend such institutions without taking on massive debt. To remedy that — at least in part — the University of Texas-Austin announced it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less per year. The University of Texas System Board of Regents voted unanimously on Tuesday to establish a $160 million endowment, drawing from the state's Permanent University Fund to begin the program in the fall of 2020. "Recognizing both the need for improved access to higher education and the high value of a UT Austin degree, we are dedicating a distribution from the Permanent University Fund to establish an endowment that will directly benefit students and make their degrees more affordable," Chairman of the Board of Regents Kevin Eltif

#**Now to Summerize the News Article**

In [None]:
print(translated_sentence)

चार वर्षांच्या महाविद्यालये आणि विद्यापीठांना आर्थिक स्पेक्ट्रमच्या खालच्या टोकापासून प्रतिभावान विद्यार्थ्यांची भरती करण्यात अडचण आहे ज्यांना मोठ्या प्रमाणात कर्ज न घेता अशा संस्थांना उपस्थित राहणे परवडत नाही.यावर उपाय म्हणून-कमीतकमी काही प्रमाणात-टेक्सास-ऑस्टिन विद्यापीठाने घोषित केले की ते राज्यातील पदवीधरांना संपूर्ण शिकवणी शिष्यवृत्ती देत आहेत ज्यांची कुटुंबे दर वर्षी, 000 65,000 किंवा त्यापेक्षा कमी करतात.२०२० च्या शरद .तूतील कार्यक्रम सुरू करण्यासाठी राज्याच्या कायम विद्यापीठाच्या निधीतून १ $ ० दशलक्ष डॉलर्सची देणगी स्थापित करण्यासाठी टेक्सास सिस्टम बोर्ड ऑफ रीजेन्ट्सने मंगळवारी एकमताने मतदान केले.यूटी ऑस्टिन पदवीचे मूल्य, आम्ही कायमस्वरुपी विद्यापीठाच्या निधीतून वितरण समर्पित करीत आहोत ज्यामुळे विद्यार्थ्यांना थेट फायदा होईल आणि त्यांचे पदवी अधिक परवडतील, ”असे मतदानानंतर सांगितले.ते म्हणाले, “यामुळे आमच्या महान राज्यातील विद्यार्थ्यांना पुढील वर्षानुवर्षे फायदा होईल,” ते पुढे म्हणाले.एन्डॉवमेंट-ज्यात वेस्ट टेक्सासमधील राज्य-मालकीच्या जमिनीवर कमावलेल्या तेल आणि गॅस रॉयल्टीच्या प

In [None]:
#!pip install bert-extractive-summarizer transformers


In [None]:
from transformers import pipeline, BartTokenizer, BartForConditionalGeneration
import spacy

# Initialize the summarization pipeline with the BART model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Initialize spaCy for segmenting the text
nlp = spacy.load("en_core_web_sm")

# Long article text
long_article = translated_sentence

# Process the  article with spaCy to split it into paragraphs
doc = nlp(long_article)
paragraphs = [str(para) for para in doc.sents]

# Initialize variables for the combined summary and lengths
combined_summary = ""
original_length = len(long_article)
summarized_length = 0

# Summarize each paragraph and combine the summaries
for paragraph in paragraphs:
    # Summarize the paragraph
    summary = summarizer(paragraph, max_length=150, min_length=50, do_sample=False)[0]["summary_text"]
    summarized_length += len(summary)
    combined_summary += summary + "\n"

# Print the length of the original and summarized articles
print(f"Original Article Length: {original_length} characters")
print(f"Summarized Article Length: {summarized_length} characters")

# Print the combined summary
print("\nCombined Summary:")
print(combined_summary)


Your max_length is set to 150, but your input_length is only 65. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=32)
Your max_length is set to 150, but your input_length is only 64. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=32)
Your max_length is set to 150, but your input_length is only 23. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=11)
Your max_length is set to 150, but your input_length is only 10. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)
Your 

Original Article Length: 4619 characters
Summarized Article Length: 6570 characters

Combined Summary:
Cambodia is one of the oldest languages in the world. It means 'cambodian' in Hindi and means "citizen" in English. It is also known as 'camelopard' and 'crocodile' in Spanish.
‘’’ ‘ ’  “’, “”, ’” and  ”. ‘'’ and ‘ ”’. ’.
“”” “ ”  ‘’ ‘”, 000 65,000’,” ”. ”, ”’’. “ ”  “, “.   “,  ”,. “   “.” , “…” . “I’m sorry, I can’t believe I’ve got to say this, but I have to say it.’
The price of $ is $1,000,000. $ is the amount of money spent on a single dollar bill. The price of a dollar is the sum of all the money that is spent on each dollar. The cost of $1.00 is the price of one dollar.
The world is a big place. It is also a very small place. We are all in it together. We're all in this together. It's time to get to know each other. We'll be here for you all the time.
‘’’ ‘”’ means ‘I’m not going to tell you what to do’,’ says the author. “”. ”” “It means I’ll tell you how to do it,” he adds. 

Or second method to summarize the text

In [None]:

from transformers import pipeline, BartTokenizer, BartForConditionalGeneration
import spacy

# Initialize the summarization pipeline with the BART model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Initialize spaCy for segmenting the text
nlp = spacy.load("en_core_web_sm")

# Long article text
long_article = translated_sentence

# Process the long article with spaCy to split it into paragraphs
doc = nlp(long_article)
paragraphs = [str(para) for para in doc.sents]

# Initialize variables for the combined summary and lengths
combined_summary = ""
original_length = len(long_article)
summarized_length = 0

# Summarize each paragraph and combine the summaries
for paragraph in paragraphs:
    # Summarize the paragraph without specifying max_length and min_length
    summary = summarizer(paragraph, do_sample=True)[0]["summary_text"]
    summarized_length += len(summary)
    combined_summary += summary + "\n"

# Print the length of the original and summarized articles
print(f"Original Article Length: {original_length} characters")
print(f"Summarized Article Length: {summarized_length} characters")

# Print the combined summary
print("\nCombined Summary:")
print(combined_summary)


Your max_length is set to 142, but your input_length is only 65. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=32)
Your max_length is set to 142, but your input_length is only 64. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=32)
Your max_length is set to 142, but your input_length is only 23. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=11)
Your max_length is set to 142, but your input_length is only 10. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)
Your 

KeyboardInterrupt: ignored

##**READ THE SUMMARIZED NEWS (Speak)**



In [None]:
!pip install gTTS



Here we read the news file and then it create audio file in which it speak or read the content of the file.

In [None]:

from gtts import gTTS
import os

# Text you want to convert to speech
text = combined_summary

# Create a gTTS object
tts = gTTS(text, lang='en')

# Save the audio file (optional)
tts.save("output.mp3")

# Play the audio using the default audio player (platform-dependent)
os.system("start output.mp3")  # On Windows



gTTSError: ignored