<a href="https://colab.research.google.com/github/yangyuwang/painting_style_prediction/blob/main/wikipedia_demographic_extract.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load wikipedia

In [1]:
from google.colab import drive
drive.mount('/content/drive/')


Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [5]:
import os
import re
from tqdm import tqdm

artist_wikipedia_text_example = []

path = '/content/drive/My Drive/artist_wikipedia_content/'

target_name = [re.sub(r".txt", "", target) for target in os.listdir(path)]

for file in tqdm(os.listdir(path)):
  with open(path + file, "r") as f:
    artist_wikipedia_text_example.append(re.sub("\n", "", f.read()))

100%|██████████| 2752/2752 [00:04<00:00, 569.52it/s]


# OPENAI settings

In [6]:
pip install openai==0.28



In [7]:
import getpass
OPENAI_API_KEY = getpass.getpass()

··········


# Extraction

In [29]:
import openai
import json
import pandas as pd

# Set your OpenAI API key
openai.api_key = OPENAI_API_KEY

def extract_demographics_few_shot(text):

    # Few-shot example using Agostino Carracci's information.
    few_shot_example = """
                        Example:
                        Text: "Agostino Carracci (or Caracci; 16 August 1557 – 22 March 1602) was an Italian painter born in Bologna. Initially, he trained as a goldsmith and then studied painting with Prospero Fontana and Bartolomeo Passarotti. He traveled to Venice to train as an engraver under Cornelis Cort."
                        Chain-of-thought:
                        1. Educational Level: Agostino was taught by established artists (Prospero Fontana and Bartolomeo Passarotti), so his educational level is "Taught by other artists".
                        2. Immigration Status: He was born in Bologna (Italy-Bologna) and later traveled to Venice (Italy-Venice). This migration is domestic, so "Domestic Migration" is 1 and "International Migration" is 0.
                        3. Gender: The text indicates he is male.
                        4. Age: He was born in 1557 and died in 1602, giving him an approximate age of 45 years.
                        Final JSON Output:
                        {
                          "Educational Level": "Taught by other artists",
                          "Immigration Status": {
                            "Origin": "Italy-Bologna",
                            "Destination": "Italy-Venice",
                            "Domestic Migration": 1,
                            "International Migration": 0
                          },
                          "Gender": "Male",
                          "Age": "45"
                        }
                        ---
                        """

    # Build the full prompt including the few-shot example.
    prompt = f"""
              You are an expert text analyst. Below is a few-shot example showing how to extract demographic attributes from a Wikipedia-style text about an artist using a chain-of-thought approach.

              {few_shot_example}

              Now, analyze the following text and extract the following demographic attributes:
              - Educational Level: Choose one from ["No formal education", "Elementary level", "Middle school level", "High school level", "College level", "Master level", "PhD level", "Taught by other artists"].
              - Immigration Status: Provide a dictionary with:
                  - "Origin" (in "Country-City" format),
                  - "Destination" (in "Country-City" format),
                  - "Domestic Migration" (1 if the migration is within the same country, 0 otherwise),
                  - "International Migration" (1 if the migration is between different countries, 0 otherwise).
              - Gender
              - Age: Calculate approximate age if birth and death or other relevant dates are given.

              Please follow these instructions:
              1. First, provide your chain-of-thought reasoning step-by-step.
              2. Then, based on your reasoning, output a final answer in JSON format with the keys "Educational Level", "Immigration Status", "Gender", and "Age".
              3. Output only a valid JSON object (without any additional text, code fences, or markdown) as the final answer.

              Text:
              \"\"\"{text}\"\"\"

              Answer:
              """
    # Call the GPT-3.5 model
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that provides detailed chain-of-thought reasoning."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.0
    )

    # Get the content of the reply
    answer = response["choices"][0]["message"]["content"]

    json_match = re.search(r"```json\s*(\{.*\})\s*```", answer, re.DOTALL)

    if json_match:
        json_str = json_match.group(1)
    else:
        # Fallback: try to extract from the first occurrence of '{'
        json_start = answer.find('{')
        json_str = answer[json_start:]

    try:
        result = json.loads(json_str)
        return result, json_str
    except json.JSONDecodeError:
        print("JSON decode error. Full response saved")
        return {}, json_str
    else:
        print("No JSON found in the response. Full response saved")
        return {}, json_str


In [31]:
# Example

# Wikipedia texts for Agostino Carracci and Agnolo Bronzino
text_carracci = """
Agostino Carracci (or Caracci; Italian pronunciation: [aɡoˈstiːno karˈrattʃi]; 16 August 1557 – 22 March 1602) was an Italian painter, printmaker, tapestry designer, and art teacher. He was, together with his brother, Annibale Carracci, and cousin, Ludovico Carracci, one of the founders of the Accademia degli Incamminati (Academy of the Progressives) in Bologna. Intended to devise alternatives to the Mannerist style favored in the preceding decades, this teaching academy helped propel painters of the School of Bologna to prominence.

Agostino Carracci was born in Bologna as the son of a tailor. He was the elder brother of Annibale Carracci and the cousin of Ludovico Carracci. He initially trained as a goldsmith. He later studied painting, first with Prospero Fontana, who had been Lodovico's master, and later with Bartolomeo Passarotti. He traveled to Parma to study the works of Correggio. Accompanied by his brother Annibale, he spent a long time in Venice, where he trained as an engraver under the renowned Cornelis Cort.
"""

text_bronzino = """
Agnolo di Cosimo (Italian: [ˈaɲɲolo di ˈkɔːzimo]; 17 November 1503 – 23 November 1572), usually known as Bronzino or Agnolo Bronzino, was an Italian Mannerist painter from Florence. His sobriquet, Bronzino, may refer to his relatively dark skin or reddish hair.
He lived all his life in Florence, and from his late 30s was kept busy as the court painter of Cosimo I de' Medici, Grand Duke of Tuscany. He was mainly a portraitist, but also painted many religious subjects and a few allegorical subjects. He trained with Pontormo, the leading Florentine painter of the first generation of Mannerism, and his style was greatly influenced by him. He was apprenticed at 14 under Pontormo and later studied with Raffaellino del Garbo.
"""

# Process each text to extract demographics using the few-shot approach
results = {}
results["Agostino Carracci"], str_1 = extract_demographics_few_shot(text_carracci)
results["Agnolo Bronzino"], str_2 = extract_demographics_few_shot(text_bronzino)

# Convert the results into a pandas DataFrame
rows = []
for name, data in results.items():
    data["Name"] = name
    rows.append(data)
df = pd.DataFrame(rows)

# Display the DataFrame
print(df)

         Educational Level                                 Immigration Status  \
0  Taught by other artists  {'Origin': 'Italy-Bologna', 'Destination': 'It...   
1  Taught by other artists  {'Origin': 'Italy-Florence', 'Destination': 'I...   

  Gender Age               Name  
0   Male  45  Agostino Carracci  
1   Male  69    Agnolo Bronzino  


In [34]:
len(demographic)

180

In [60]:
def summarize_text(text, max_words=200):
    prompt = f"Summarize the following text in {max_words} words while keeping all relevant biographical and demographic details:\n\n{text}"

    response = openai.ChatCompletion.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.5
    )

    return response["choices"][0]["message"]["content"]

In [61]:
# demographic = {}

for i in tqdm(range(len(target_name))):
    if target_name[i] not in demographic.keys():
      try:
          atrributes, str_json = extract_demographics_few_shot(artist_wikipedia_text_example[i])
      except Exception as e:
          print(e, "Too long... Summarize first")
          atrributes, str_json = extract_demographics_few_shot(summarize_text(artist_wikipedia_text_example[i]))
      atrributes["str"] = str_json
      demographic[target_name[i]] = atrributes

 44%|████▎     | 1203/2752 [00:22<00:41, 37.25it/s]

This model's maximum context length is 16385 tokens. However, your messages resulted in 17696 tokens. Please reduce the length of the messages. Too long... Summarize first


 47%|████▋     | 1290/2752 [03:27<41:18,  1.70s/it]

This model's maximum context length is 16385 tokens. However, your messages resulted in 16558 tokens. Please reduce the length of the messages. Too long... Summarize first


 48%|████▊     | 1316/2752 [04:36<57:25,  2.40s/it]

This model's maximum context length is 16385 tokens. However, your messages resulted in 17038 tokens. Please reduce the length of the messages. Too long... Summarize first


 60%|██████    | 1657/2752 [15:31<40:40,  2.23s/it]

JSON decode error. Full response saved


 60%|██████    | 1663/2752 [15:41<37:48,  2.08s/it]

JSON decode error. Full response saved


 71%|███████   | 1947/2752 [25:21<31:31,  2.35s/it]

JSON decode error. Full response saved


 77%|███████▋  | 2131/2752 [30:59<21:33,  2.08s/it]

This model's maximum context length is 16385 tokens. However, your messages resulted in 21944 tokens. Please reduce the length of the messages. Too long... Summarize first


 79%|███████▉  | 2182/2752 [32:52<13:16,  1.40s/it]

This model's maximum context length is 16385 tokens. However, your messages resulted in 20425 tokens. Please reduce the length of the messages. Too long... Summarize first


 85%|████████▍ | 2327/2752 [37:48<11:28,  1.62s/it]

This model's maximum context length is 16385 tokens. However, your messages resulted in 23115 tokens. Please reduce the length of the messages. Too long... Summarize first


 92%|█████████▏| 2544/2752 [45:45<08:25,  2.43s/it]

JSON decode error. Full response saved


 96%|█████████▌| 2648/2752 [49:19<03:39,  2.11s/it]

This model's maximum context length is 16385 tokens. However, your messages resulted in 16715 tokens. Please reduce the length of the messages. Too long... Summarize first


100%|██████████| 2752/2752 [52:55<00:00,  1.15s/it]


In [62]:
rows = []
for name, data in demographic.items():
    data["Name"] = name
    rows.append(data)

df = pd.DataFrame(rows)

df.head()

Unnamed: 0,Educational Level,Immigration Status,Gender,Age,str,Name
0,College level,"{'Origin': 'USA-New York City', 'Destination':...",Female,82,"{\n ""Educational Level"": ""College level"",\n ...",betty-parsons
1,Master level,"{'Origin': 'USA-Chicago', 'Destination': 'Fran...",Female,67,"{\n ""Educational Level"": ""Master level"",\n ""...",joan-mitchell
2,College level,"{'Origin': 'USA-Syracuse', 'Destination': 'USA...",Male,92,"{\n ""Educational Level"": ""College level"",\n ...",robert-goodnough
3,College level,"{'Origin': 'USA-Port Arthur', 'Destination': '...",Male,82,"{\n ""Educational Level"": ""College level"",\n ...",robert-rauschenberg
4,College level,"{'Origin': 'Spain-Turégano', 'Destination': 'U...",Male,97,"{\n ""Educational Level"": ""College level"",\n ...",esteban-vicente


In [63]:
df.to_csv('/content/drive/My Drive/artist_demographic_wikipedia.csv')

In [73]:
df[pd.isna(df["Educational Level"]) == True]["Name"]

Unnamed: 0,Name
55,moshe-kupferman
64,john-mccracken
89,alexander-liberman
157,thomas-downing
178,matsutani
187,ian-davenport
250,jean-le-moal
451,daniel-richter
469,takis
486,turi-simeti
