# Setup Gemini API

In [1]:
import os

In [2]:
os.chdir("../4_Data")

In [12]:
env = {}
with open(".env", "r") as env_file:
    for line in env_file:
        if "=" in line and not line.startswith("#"):  # Ignore comments and invalid lines
            key, value = line.strip().split("=", 1)
            env[key] = value

print(env.keys())

dict_keys(['NYT_KEY', 'NYT_SECRET', 'GOOGLE_API_KEY'])


In [14]:
from google import genai

client = genai.Client(api_key=env["GOOGLE_API_KEY"]) # store your API key in the secrets or replace here

response = client.models.generate_content(
    model="gemini-2.0-flash", # gemini-2.5-flash-preview-05-20
    contents="Explain how AI works in a few words"
)

print(response.text)


AI learns from data to make predictions or decisions.



check the documentation here: https://pypi.org/project/google-genai/

Quick in-class exercise to setup Gemini API Key?

# Load some data

In [15]:
import pandas as pd

lead_paragraphs = pd.read_csv("http://farys.org/daten/nyt_chatgpt.csv")
full_articles = pd.read_csv("http://farys.org/daten/nyt_chatgpt_fulltexts.csv")


In [16]:
lead_paragraphs.head()

Unnamed: 0,abstract,web_url,pub_date,document_type
0,Artificial intelligence is confronting white-c...,https://www.nytimes.com/2023/03/28/business/ec...,2023-03-28T09:00:25Z,article
1,The company unveiled new technology called GPT...,https://www.nytimes.com/2023/03/14/technology/...,2023-03-14T17:04:15Z,article
2,Journalists may have a new semi-reliable source.,https://www.nytimes.com/2023/04/21/opinion/cha...,2023-04-21T09:00:19Z,article
3,With the rise of the popular new chatbot ChatG...,https://www.nytimes.com/2023/01/16/technology/...,2023-01-16T10:00:26Z,article
4,The action by Italy’s data protection agency i...,https://www.nytimes.com/2023/03/31/technology/...,2023-03-31T16:02:02Z,article


**some minor cleaning ...**

In [17]:
import re

filters = [
    "Every Tuesday and Friday",
    "Want to get this newsletter in your inbox",
    "To the Editor:"
]

pattern = "|".join(filters)

lead_paragraphs = lead_paragraphs[
    (lead_paragraphs["document_type"] == "article") &
    (~lead_paragraphs["abstract"].str.contains(pattern, na=False))
]


**Narrow down to 50**

In [18]:
lead_paragraphs = lead_paragraphs.head(50)

**Prepare one-shot example prompt**

In [19]:
example_text = full_articles["fulltext"].iloc[1]

prompt = f"""Extract the 5 main topics of the following text. Then, categorize the text according to how strongly it is described by each topic. The output should look like this:

NameOfTopic1: 50%
NameOfTopic2: 20%
NameOfTopic3: 15%
NameOfTopic4: 15%
NameOfTopic5: 0%

Here is the text:
{example_text}
"""
print(prompt)


Extract the 5 main topics of the following text. Then, categorize the text according to how strongly it is described by each topic. The output should look like this:

NameOfTopic1: 50%
NameOfTopic2: 20%
NameOfTopic3: 15%
NameOfTopic4: 15%
NameOfTopic5: 0%

Here is the text:
Have you played with generative artificial intelligence like ChatGPT, for text, or DALL-E or Midjourney, for images? What have you used it for? What have you discovered?If you’re new to the topic, this glossary explains that generative A.I. is technology that creates content — including text, images, video and computer code — by identifying patterns in large quantities of training data, and then creating original material that has similar characteristics.In their article “35 Ways Real People Are Using A.I. Right Now,” Francesca Paris and Larry Buchanan explain how these tools are being used to help with everyday tasks:The public release of ChatGPT last fall kicked off a wave of interest in artificial intelligence. A

**Adjust the temperature to get consistent results.**

Illustration of the temperature concept: https://poloclub.github.io/transformer-explainer/

In [24]:
from google.genai import types # to control temperature

response = client.models.generate_content(
    model="gemini-2.5-flash-preview-05-20",
    contents=prompt,
    config=types.GenerateContentConfig(
        temperature=0.0 # Adjust this value between 0.0 and 1.0; for replicable results 0 is best
    )
)

In [25]:
print(response.text)

Here are the 5 main topics of the text and their categorization:

Generative AI Applications & Use Cases: 50%
Benefits & Impact of Generative AI: 20%
Ethical Considerations of Generative AI: 15%
Personal Experience & Engagement with Generative AI: 10%
Definition & Mechanics of Generative AI: 5%


Maybe its better to derive a handful of topics from more than 1 text?
- could give the complete corpus to generate topics
- could ask for "topics around chatgpt/LLMs"

**Batch processing** (50 texts at once)

In [26]:
mytexts_with_delimiters = "\n".join(
    f"## Paragraph {i+1}\n[{text}]\n" for i, text in enumerate(lead_paragraphs["abstract"])
)

batch_prompt = f"""
Below are lead paragraphs from New York Times articles. Each paragraph is marked with '## Paragraph [Number]'. Here are the paragraphs:

{mytexts_with_delimiters}

Based on these lead paragraphs, perform the following 3 tasks for each paragraph:

1) Categorize the paragraph by how strongly it is described by these 5 topics (total must be 100%):
1. Applications and Use Cases
2. Technological Advancements
3. Social and Ethical Implications
4. Business and Economic Impact
5. Public Perception and Cultural Influence

2) Is it about 'ChatGPT'? ('yes' or 'no')

3) Is the paragraph in favor of ChatGPT technology? (-1 = against, 0 = neutral, 1 = in favor)

Output the response as a JSON list:
{{
    "paragraph_id": ##,
    "topic_applications": xx,
    "topic_advancements": xx,
    "topic_social": xx,
    "topic_business": xx,
    "topic_cultural": xx,
    "topic_other": xx,
    "about_chatgpt": "yes/no",
    "in_favor": value_between_-1_and_1
}}

'xx' are percentages and must total 100%.
"""


In [27]:
print(batch_prompt)


Below are lead paragraphs from New York Times articles. Each paragraph is marked with '## Paragraph [Number]'. Here are the paragraphs:

## Paragraph 1
[Artificial intelligence is confronting white-collar professionals more directly than ever. It could make them more productive — or obsolete.]

## Paragraph 2
[The company unveiled new technology called GPT-4 four months after its ChatGPT stunned Silicon Valley. The update is an improvement, but it carries some of the same baggage.]

## Paragraph 3
[Journalists may have a new semi-reliable source.]

## Paragraph 4
[With the rise of the popular new chatbot ChatGPT, colleges are restructuring some courses and taking preventive measures.]

## Paragraph 5
[The action by Italy’s data protection agency is the first known instance of the chatbot’s being blocked by a government order.]

## Paragraph 6
[Powerful new artificial-intelligence software is already shaking up the travel industry, but it has a long way to go until it can plan a seamle

**Send to LLM**

In [28]:
response = client.models.generate_content(
    model="gemini-2.5-flash-preview-05-20",
    contents=batch_prompt,
    config=types.GenerateContentConfig(
        temperature=0.0,
        response_mime_type='application/json' # nice option to directly ask for clean json output
    )
)


In [30]:
print(response.text)

[
  {
    "paragraph_id": 1,
    "topic_applications": 20,
    "topic_advancements": 0,
    "topic_social": 40,
    "topic_business": 40,
    "topic_cultural": 0,
    "about_chatgpt": "no",
    "in_favor": 0
  },
  {
    "paragraph_id": 2,
    "topic_applications": 10,
    "topic_advancements": 60,
    "topic_social": 20,
    "topic_business": 10,
    "topic_cultural": 0,
    "about_chatgpt": "yes",
    "in_favor": 0
  },
  {
    "paragraph_id": 3,
    "topic_applications": 70,
    "topic_advancements": 0,
    "topic_social": 30,
    "topic_business": 0,
    "topic_cultural": 0,
    "about_chatgpt": "no",
    "in_favor": 0
  },
  {
    "paragraph_id": 4,
    "topic_applications": 10,
    "topic_advancements": 0,
    "topic_social": 70,
    "topic_business": 0,
    "topic_cultural": 20,
    "about_chatgpt": "yes",
    "in_favor": -1
  },
  {
    "paragraph_id": 5,
    "topic_applications": 0,
    "topic_advancements": 0,
    "topic_social": 80,
    "topic_business": 20,
    "topic_cultu

**Extract JSON and convert to dataframe**

In [33]:
import json

df = pd.DataFrame(json.loads(response.text))
df["originaltext"] = lead_paragraphs["abstract"].values

df.head(11)  # or use df.sample(10) for a random sample

# index 11 is interesting!

Unnamed: 0,paragraph_id,topic_applications,topic_advancements,topic_social,topic_business,topic_cultural,about_chatgpt,in_favor,originaltext
0,1,20,0,40,40,0,no,0,Artificial intelligence is confronting white-c...
1,2,10,60,20,10,0,yes,0,The company unveiled new technology called GPT...
2,3,70,0,30,0,0,no,0,Journalists may have a new semi-reliable source.
3,4,10,0,70,0,20,yes,-1,With the rise of the popular new chatbot ChatG...
4,5,0,0,80,20,0,no,-1,The action by Italy’s data protection agency i...
5,6,50,20,0,30,0,no,0,Powerful new artificial-intelligence software ...
6,7,70,30,0,0,0,no,1,Large language models are already good at a wi...
7,8,0,0,0,40,60,yes,1,The San Francisco artificial intelligence lab ...
8,9,0,20,30,50,0,no,-1,The promised “live” demonstration of the bot h...
9,10,0,0,0,30,70,yes,1,"Even inside the company, the chatbot’s popular..."


potential problems:

- are results consistent?
- does the LLM have enough information to classify?
- whatr about more than 50 articles? (context window size matters)

**For the fulltexts:**

In [34]:
mytexts_with_delimiters = "\n".join(
    f"## Article {i+1}\n[{text}]\n" for i, text in enumerate(full_articles["fulltext"])
)

batch_prompt = f"""
Below are articles from the New York Times. Each article is marked with '## Article [Number]'. Here are the articles:

{mytexts_with_delimiters}

Based on these articles, perform the following 3 tasks for each article:

1) Categorize the article by how strongly it is described by these 5 topics (total must be 100%):
1. Applications and Use Cases
2. Technological Advancements
3. Social and Ethical Implications
4. Business and Economic Impact
5. Public Perception and Cultural Influence

2) Is it about 'ChatGPT'? ('yes' or 'no')

3) Is the article in favor of ChatGPT technology? (-1 = against, 0 = neutral, 1 = in favor)

Output the response as a JSON list:
{{
    "article_id": ##,
    "topic_applications": xx,
    "topic_advancements": xx,
    "topic_social": xx,
    "topic_business": xx,
    "topic_cultural": xx,
    "topic_other": xx,
    "about_chatgpt": "yes/no",
    "in_favor": value_between_-1_and_1
}}

'xx' are percentages and must total 100%.
"""


In [35]:
print(batch_prompt)


Below are articles from the New York Times. Each article is marked with '## Article [Number]'. Here are the articles:

## Article 1
[Times Insider explains who we are and what we do and delivers behind-the-scenes insights into how our journalism comes together.Prepare to wait awhile longer for the return of “Stranger Things,” “Abbott Elementary” and a number of other beloved TV shows.Last week, the Writers Guild of America, which represents about 11,500 writers of television and film, went on strike after contract negotiations with studios and streaming services failed. It is the W.G.A.’s first strike in 15 years; the last strike, which began in 2007, lasted 100 days.The impasse has brought many television productions to a standstill. Last week, picket lines formed in Los Angeles and New York; late-night talk shows went dark and their slots have since been filled with reruns. With no solution in sight, the fate of popular shows — and perhaps, the fall TV season — hangs in the balance.

In [36]:
response = client.models.generate_content(
    model="gemini-2.5-flash-preview-05-20",
    contents=batch_prompt,
    config=types.GenerateContentConfig(
        temperature=0.0,
        response_mime_type='application/json' # explain more about JSON here?
    )
)


In [37]:
print(response.text)

[
  {
    "article_id": 1,
    "topic_applications": 10,
    "topic_advancements": 10,
    "topic_social": 30,
    "topic_business": 40,
    "topic_cultural": 10,
    "topic_other": 0,
    "about_chatgpt": "yes",
    "in_favor": -1
  },
  {
    "article_id": 2,
    "topic_applications": 60,
    "topic_advancements": 10,
    "topic_social": 20,
    "topic_business": 0,
    "topic_cultural": 10,
    "topic_other": 0,
    "about_chatgpt": "yes",
    "in_favor": 1
  },
  {
    "article_id": 3,
    "topic_applications": 20,
    "topic_advancements": 30,
    "topic_social": 10,
    "topic_business": 35,
    "topic_cultural": 5,
    "topic_other": 0,
    "about_chatgpt": "yes",
    "in_favor": 0
  },
  {
    "article_id": 4,
    "topic_applications": 5,
    "topic_advancements": 5,
    "topic_social": 40,
    "topic_business": 0,
    "topic_cultural": 50,
    "topic_other": 0,
    "about_chatgpt": "yes",
    "in_favor": 0
  },
  {
    "article_id": 5,
    "topic_applications": 15,
    "topic_

**Extract JSON and convert to dataframe**

In [39]:
import json

df = pd.DataFrame(json.loads(response.text))

def truncate_text(s, length=50):
    return s if len(s) <= length else s[:length] + "..."

df["shorttext"] = full_articles["fulltext"].astype(str).apply(lambda x: truncate_text(x, 50))
#df["originaltext"] = full_articles["fulltext"].values

df

Unnamed: 0,article_id,topic_applications,topic_advancements,topic_social,topic_business,topic_cultural,topic_other,about_chatgpt,in_favor,shorttext
0,1,10,10,30,40,10,0,yes,-1,Times Insider explains who we are and what we ...
1,2,60,10,20,0,10,0,yes,1,Have you played with generative artificial int...
2,3,20,30,10,35,5,0,yes,0,When the San Francisco start-up OpenAI release...
3,4,5,5,40,0,50,0,yes,0,An article on Saturday about an increased visi...
4,5,15,10,40,25,10,0,yes,-1,Television loves a good sentient-machine story...
5,6,0,0,1,0,99,0,yes,0,I’ve made a decision that I feel very good abo...
6,7,15,5,10,65,5,0,yes,0,"Working through the bankruptcy of FTX, Sam Ban..."
7,8,10,5,15,65,5,0,yes,0,Who would want to be a chief executive?The pre...
8,9,5,5,30,30,30,0,yes,0,There is no escaping Elon Musk. Having thrust ...
9,10,10,5,10,70,5,0,yes,0,European antitrust regulators are set to weigh...


In [40]:
import textwrap
print(textwrap.fill(full_articles["fulltext"][4], width=80)) # negative article -> prediction seems pretty good

Television loves a good sentient-machine story, from “Battlestar Galactica” to
“Westworld” to “Mrs. Davis.” With the Writers Guild of America strike, that
premise has broken the fourth wall. The robots are here, and the humans are
racing to defend against them, or to ally with them.Among the many issues in the
strike is the union’s aim to “regulate use of material produced using artificial
intelligence or similar technologies,” at a time when the ability of chatbots to
auto-generate all manner of writing is growing exponentially.In essence, writers
are asking the studios for guardrails against being replaced by A.I., having
their work used to train A.I. or being hired to punch up A.I.-generated scripts
at a fraction of their former pay rates.The big-ticket items in the strike
involve, broadly, how the streaming model has disrupted the ways TV writers have
made a living. But it’s the A.I. question that has captured imaginations,
understandably so. Hollywood loves robot stories because t

In [41]:
print(textwrap.fill(full_articles["fulltext"][9], width=80)) # business topic -> also pretty good

European antitrust regulators are set to weigh in Monday on Microsoft’s $69
billion takeover of Activision Blizzard. In a twist, the European Commission is
reportedly set to approve a video game megadeal that its American and British
counterparts have already rejected.If that happens, tech giants will be left
with an even more confusing regulatory landscape to contend with, as three of
the world’s most powerful antitrust regulators take different policy tacks.The
E.U. is expected to be satisfied with Microsoft’s concessions on the deal,
namely pledges to make sure top titles like Call of Duty and World of Warcraft
are available to rival video game platforms like Sony’s and Nintendo’s.That
would be a very different conclusion than that reached last month by Britain’s
Competition and Markets Authority, which argued that Microsoft could end up
dominating the nascent business of cloud gaming — and that no solution apart
from selling off big chunks of Activision would be acceptable.It would

**Summary**

- pros of using LLMs:
  + can prompt/ask whatever is of interest
  + not a lot of cleaning required
- cons:
  + hallucinations (e.g. sentiments for rubbish texts)
  + results might vary between requests (temperature parameter!)
  + costs
  + bias

**Where to go from here?**

- improve prompt
  + better parsing
  + more accurate instructions and delimiters
  + examples (few shot prompting)
  + time to think
  + ask for reasoning (why is it "not about ChatGPT", why the sentiment?)
  + calculate costs beforehand
- reflect about potential biases
  + in training data
  + in algorithms / finetuning / HFRL
  + in the prompt itself
- potential problems:
  + model being trusted too much
  + hallucinations
  + exposing sensitive data
- how to structure your code for larger requests
  + junking of batches (pros/cons, e.g. calibration)
  + repeating requests for stability of results?
  + comparing models
  + sampling of results (better than human encoding or not?)

## Appendix

**Example of highly structued prompt using XML**

In [42]:
# Format reviews as XML
review_texts = [...]  # your list of reviews
mycontext = "\n".join(f"<review>{review}</review>" for review in review_texts)

# Create the XML prompt
xmlprompt = f"""
<purpose>
Analyse the sentiment of Amazon reviews for a Tamagochi.
</purpose>

<instruction>
<instruction>Below are Amazon reviews for a Tamagochi. Provide a summary of around 10 words specifying whether the sentiment of each paragraph is Negative, Neutral, or Positive, and briefly explaining the reason why.</instruction>
<instruction>Provide a sentiment score between -1 and 1 for each paragraph. Negative values indicate negative sentiment, positive values indicate positive sentiment, and 0 is neutral.</instruction>
</instruction>

<example-output>
<review>
    <number>1</number>
    <summary>Neutral - Mixed feelings about visibility and setup issues.</summary>
    <score>0.0</score>
</review>
</example-output>

<reviews>
{mycontext}
</reviews>
"""
