<a href="https://colab.research.google.com/github/wingated/cs180/blob/main/data_science_labs/data_science_lab_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><p><b>After clicking the "Open in Colab" link, copy the notebook to your own Google Drive before getting started, or it will not save your work</b></p>

# BYU CS 180 Lab 9: Gen AI for Data Science

## Introduction:
In this lab, you will practice using generative AI as part of a data science workflow.

For this project, you will use a language model to analyze unstructured text and visualize the results. The text comes from transcripts of news coverage during April 2020, during the COVID-19 outbreak. There are 1,000 total transcripts.  Each transcript is of an interview. (This is a subset of the MediaSum dataset, https://github.com/zcgzcgzcg1/MediaSum)

Each interview consists of the text of the transcript, plus metadata about the transcript (such as the date, the program it was broadcast on, etc.)

For our purposes, we are mostly interested in the "utt" field of the data.

# Exercise 1: downloading and loading

### Step 1: download and load the data

Begin by downloading the data from the class website:

In [None]:
wget https://wingated.github.io/cs180/covid1k.json

and loading it:

In [None]:
import json
data = json.load( open("covid1k.json", "r") )

# Exercise 2: extraction

For each transcript, your task is to figure out who the interviewer is, who the guest is, and whether or not the guest is a medical doctor.

This task is easy for humans, but is hard for traditional NLP methods such as keyword analysis, parsing, part-of-speech tagging, etc.

To do this, we will use generative AI - specifically, a large language model. Here is an example of using the OpenAI ChatCompletion API to take a prompt, generate and return a completion:

In [None]:
import openai
openai.api_key = "XXX PUT YOUR API KEY HERE XXX"

def do_query( messages, max_tokens=512, temperature=1.0, model="gpt-4o-mini" ):

    response = openai.ChatCompletion.create(
        messages=messages,
        model=model,
        max_tokens=max_tokens,
        temperature=temperature,
        )

    return response['choices'][0]['message']['content']

prompt = "When asked whether Coke or Pepsi is better, I respond that"

messages = [
    {"role": "system", "content": prompt },
]

response = do_query( messages )

print( response )


You must do the following:

* Create a prompt template
* For each transcript
  * Create a prompt from the template
  * Run the prompt through the language model
* Store the results in an appropriate datastructure
  * You may want to store additional meta-data about the interview, to support visualizations later on!

Then, you must:
* For each result:
  * Parse the result and determine the interviewer, the guest, and whether or not they are a doctor.
* Create a dataframe with the results

When all is said and done, you should have a data frame that looks something like this:

In [None]:
       interviewer              guest  is_doctor                      program        date
0         Anderson   Dr. Sanjay Gupta       True  ANDERSON COOPER 360 DEGREES  2020-04-01
1         Anderson  Dr. Craig Spencer       True  ANDERSON COOPER 360 DEGREES  2020-04-01
2         Anderson   Gretchen Whitmer      False  ANDERSON COOPER 360 DEGREES  2020-04-01
3        Carl Azuz   Dr. Sanjay Gupta       True                       CNN 10  2020-04-01
4       John Vause      Dr. Raj Kalsi       True                 CNN NEWSROOM  2020-04-01
5             John       Steven Jiang      False                 CNN NEWSROOM  2020-04-01
6       Jim Acosta  Dr. Anthony Fauci       True                 CNN NEWSROOM  2020-04-01
7             John         Vedika Sud      False                 CNN NEWSROOM  2020-04-01
8       John Vause       Amanda Davis      False                 CNN NEWSROOM  2020-04-01
9  Rosemary Church   Anthony Costello       True                 CNN NEWSROOM  2020-04-01
...

**Hints:**

You may need to try multiple prompt templates to get one that works. Experiment on a subset of the data (maybe only 10 interviews) until you get something working.

Make the language model do the work! You can ask it to output its answers to your questions in a structured format that's easy to parse.

Major hint: I asked the language model to output its results in JSON format, and it worked almost flawlessly.  You may need to do some post-processing.

# Exercise 3: visualizations

With our newly processed data in hand, craft three interesting visualizations.  These can be anything you want -- maybe the percentage of times a doctor was interviewed on CNN vs. NPR? Maybe a per-host breakdown of guests? Maybe a trend over time? etc.

In [None]:
# your code here