<a href = "https://www.pieriantraining.com"><img src="../PT Centered Purple.png"> </a>

<em style="text-align:center">Copyrighted by Pierian Training</em>

## Translation and Summary of Foreign News

The goal of this project will be to get a summary of news in foreign countries by automatically scraping a newspapers headlines or articles, and then translating the text with OpenAI and also providing a summary format.

### Data (Web Scraping)

Let's explore how we could set-up a scraping job (note this has its limits, not every website can be easily scraped, your results may vary on other newpaper sites).

In [1]:
import requests
import bs4

In [2]:
# This is just a limited example dictionary
# As each website is different, this gets harder with more countries!
country_newspapers = {"Spain":'https://elpais.com/', 
                       "France":"https://www.lemonde.fr/"
                     }

In [3]:
# Note how 
country = input("What country are you interested in for a news summary? ")

What country are you interested in for a news summary? France


In [4]:
url = country_newspapers[country]

In [5]:
url

'https://www.lemonde.fr/'

In [6]:
result = requests.get(url)

### Parsing HTML 

In [7]:
soup = bs4.BeautifulSoup(result.text,"lxml")

In [8]:
# France Le Monde
soup.select('.article__title-label')

[<p class="article__title-label">Trois morts et six blessés dans une attaque, selon la police israélienne</p>]

Clearly, this won't be the same title tag for Spain's El Pais, let's quickly test that out, then we can recombine these tags into the dictionary to simplify and have all the information in one location.

In [9]:
spain_results = requests.get('https://elpais.com/')
soup = bs4.BeautifulSoup(spain_results.text,"lxml")

In [10]:
len(soup.select('.c_t'))

150

In [11]:
soup.select('.c_t')[:3]

[<h2 class="c_t"><a href="https://elpais.com/espana/2023-11-30/psoe-y-junts-pactan-ocultar-el-nombre-del-verificador-para-evitar-presiones.html"><span class="c_t_i c_t_i-s _pr" name="elpais_ico"></span>El PSOE y Junts pactan ocultar la identidad del verificador para evitar presiones</a></h2>,
 <h2 class="c_t"><a href="https://elpais.com/espana/2023-11-30/ultimas-noticias-de-la-actualidad-politica-en-directo.html">Sánchez: “La amnistía no era el paso que quería dar, pero es coherente con la normalización en Cataluña”</a></h2>,
 <h2 class="c_t"><a href="https://elpais.com/espana/catalunya/2023-11-30/el-pp-catalan-debe-superar-las-tutelas-y-tomar-sus-decisiones-de-manera-mas-libre.html"><span class="c_t_i c_t_i-s _pr" name="elpais_ico"></span>Alejandro Fernández: “El PP catalán debe tomar las decisiones de forma más libre”</a></h2>]

#### Grabbing Text

Now we just need to figure out how to grab these headlines text:

In [12]:
for tag in soup.select('.c_t')[:3]:
    print(tag.getText())

El PSOE y Junts pactan ocultar la identidad del verificador para evitar presiones
Sánchez: “La amnistía no era el paso que quería dar, pero es coherente con la normalización en Cataluña”
Alejandro Fernández: “El PP catalán debe tomar las decisiones de forma más libre”


### Combining Dictionary Tag Title and URL

In [13]:
# "Country" : (URL,HTML_TAG)
country_newspapers = {"Spain":('https://elpais.com/','.c_t'), 
                       "France":("https://www.lemonde.fr/",'.article__title-label')
                     }

## Translation via OpenAI

In [14]:
def create_system_prompt():
    return "Detect the language of the news headlines below, then translate a summary of the headlines to English in a conversational tone."

In [15]:
def create_prompt():
    # Get Country
    country = input("What country would you like a news summary for? ")
    # Get country's URL newspaper and the HTML Tag for titles
    try:
        url,tag = country_newspapers[country]
    except:
        print("Sorry that country is not supported!")
        return
    
    # Scrape the Website
    results = requests.get(url)
    soup = bs4.BeautifulSoup(results.text,"lxml")
    
    # Grab all the text
    country_headlines = ''
    for item in soup.select(tag)[:3]:
        country_headlines += item.getText()+'\n'
        
    return country_headlines

In [16]:
prompt = create_prompt()

What country would you like a news summary for? Spain


In [17]:
print(prompt)

El PSOE y Junts pactan ocultar la identidad del verificador para evitar presiones
Sánchez: “La amnistía no era el paso que quería dar, pero es coherente con la normalización en Cataluña”
Alejandro Fernández: “El PP catalán debe tomar las decisiones de forma más libre”



In [18]:
from openai import OpenAI
client = OpenAI()

In [19]:
response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": create_system_prompt()},
                {"role": "user", "content": prompt},
            ],
            temperature=0.1, # Helps conversational tone a bit, optional
            top_p=1.0,
            frequency_penalty=0.0,
            presence_penalty=0.0,
            max_tokens=200,
)


In [21]:
print(response.choices[0].message)

ChatCompletionMessage(content='The first headline is in Spanish. It says "PSOE and Junts agree to hide the identity of the verifier to avoid pressure." This refers to a political agreement between the Spanish Socialist Workers\' Party (PSOE) and Junts, a Catalan political party, to keep the identity of the verifier of the Catalan independence process confidential in order to prevent external pressures.\n\nThe second headline is also in Spanish. It says "Sánchez: \'Amnesty was not the step I wanted to take, but it is consistent with normalization in Catalonia\'." This is a statement made by Pedro Sánchez, the leader of PSOE and the Prime Minister of Spain, regarding the controversial topic of amnesty for Catalan politicians involved in the independence movement. Sánchez explains that while amnesty was not his preferred approach, it aligns with the goal of normalizing the situation in Catalonia.\n\nThe third headline is also in Spanish. It says "Alejandro Fernández: \'Catalan PP should m