<a href="https://colab.research.google.com/github/vanderbilt-data-science/ai_summer/blob/main/0_p4generative_ai_essentials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python for Generative AI - The Essentials
> Day 1: Introduction to Python

Based on [vanderbilt-data-science/p4a-essentials](https://github.com/vanderbilt-data-science/p4ai-essentials)

In today's workshop, you'll learn:
- Introduction to Google Colab
- Introduction to Python as a computational engine
- Foundations of Python: variables, data elements, and data structures

using Generative AI as a coding assistant.

# The Problem
You have a set of emails that you'd like to parse into their individual components. You'd like a section for the header which will allow you to look up any individual field value (e.g., if you want to look up the `Subject` of this email, you will expect to get `INFO NEEDED: Gaucher's Disease`). Then, you want to be able to access any individual email in the set as well. An example email is as follows:

```
Newsgroups: sci.med
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!darwin.sura.net!sgiblab!sdd.hp.com!decwrl!decwrl!uunet!utcsri!utnut!utzoo!telly!problem!intacc!bed
From: bed@intacc.uucp (Deb Waddington)
Subject: INFO NEEDED: Gaucher's Disease
Message-ID: <1993Mar18.002149.1111@intacc.uucp>
Date: Thu, 18 Mar 1993 00:21:49 GMT
Distribution: Everywhere
Expires: 01 Jun 93
Reply-To: bed@intacc.UUCP (Deb Waddington)
Organization: Matrix Artists' Network
Lines: 33


I have a 42 yr old male friend, misdiagnosed as having
 osteopporosis for two years, who recently found out that his
 illness is the rare Gaucher's disease. 

Gaucher's disease symptoms include: brittle bones (he lost 9 
 inches off his hieght); enlarged liver and spleen; internal
 bleeding; and fatigue (all the time). The problem (in Type 1) is
 attributed to a genetic mutation where there is a lack of the
 enzyme glucocerebroside in macrophages so the cells swell up.
 This will eventually cause death.

Enyzme replacement therapy has been successfully developed and
 approved by the FDA in the last few years so that those patients
 administered with this drug (called Ceredase) report a remarkable
 improvement in their condition. Ceredase, which is manufactured
 by biotech biggy company--Genzyme--costs the patient $380,000
 per year. Gaucher's disease has justifyably been called "the most
 expensive disease in the world".

NEED INFO:
I have researched Gaucher's disease at the library but am relying
 on netlanders to provide me with any additional information:
**news, stories, reports
**people you know with this disease
**ideas, articles about Genzyme Corp, how to get a hold of
   enough money to buy some, programs available to help with
   costs.
**Basically ANY HELP YOU CAN OFFER

Thanks so very much!

Deborah 

```

## Breakout Room 1: Conceptual Solution (5 mins)
Assume that you're instructing an assistant to do this. You have all of the emails and infinite interns which you can consider as one power intern who has lots of time but is unfamiliar with data in general. You of course have the tools you're familiar with (e.g., Excel, Word, whatever makes sense to you) or that you commonly use. Write a set of instructions for your intern to tell them how they should go about extracting and organizing this data.



---



# A Programmatic Solution in Python

We can use python and Google Colab!

# Introduction to Google Colab
Google Colab facilitates literate programming, meaning that it's not just code and code comments. We can narrate, as you see here! Double click this cell to inspect what markdown notation looks like!

In [None]:
# This is a code cell. This line is a comment. To execute this cell (provide information and/or instruction to our kernel "child"),
# use the arrow on the LHS of the cell. Click it now!

That's right, nothing of importance happened. We didn't give it any information or instruction. It essentially just received a blank fax. We'll come back to this.

Let's look at some other tools provided by Google Colab before we get into the Python Language:
* Hovering over the bottom of the cell provides options to create a new code cell or a new text/markdown cell
* Shift + Enter executes the cell (without having to move your hands to the mouse/touchpad to click run)
* If you ever want to Run cells in a certain way, for example all cells without executing them individually, check out the `Runtime` tab.

There is also some functionality we won't talk about here, but message us and we'll make a separate session given enough demand. For example:
* Mounting Google Drive (allowing you access to your cloud/shared data/documents)
* Using scratchpads
* Connecting to a local Jupyter runtime (Colab is in the cloud on far away machines; you could also connect it to your lab machine instead!)

# The Python Kernel


##  Things Python Already Knows: Calculator Functionality
Let's explore "providing instructions" and "providing information". As we said, there are things that the interns "knows of" and "knows how to do" already. Let's look at the most important ones in Python.

### Data types
Python already knows a specific set of basic data types and knows how to interact with them in some somewhat expected ways. Let's check some out.

In [None]:
# It gets what numbers are and common things we do with them
print(7 + 7)

# It can work with text/string data
print('The data is ' + 'long')

# It can automatically navigate (some) different data types
print(2.8 + 9)

Using just raw numbers/hard coding is diminishing returns. Programming is built on the back of increasing levels of abstraction. We do this by assigning values to variables.

In [None]:
# Numbers
var_integer = 7
var_float = 0.22

# Strings
var_string = 'the dog is cute'

It is meaningless for us to just write this code....how does it get executed? If we just write the code and don't execute it, it's like speaking the instructions to the child that isn't even listening. Check this out.

Before executing the cells above, execute the following cell:

In [None]:
dir()

These are some of the variables that are already populated and "saved" in the kernel. Let's see what happens when we actually execute the cells.

Press the arrow next to the cell which defines some of our different variable types. Then, run the following cell.

In [None]:
dir()

Notice the new information in your kernel! These variables have been "saved" into your kernel and are now new pieces of information to be referenced. This is the fundamental operation of the kernel. Keep in mind two things:

1. Notice that the execution/output of the first `dir()` code cell did not change. This reflects the outputs and the state of the kernel at the time that you ran the cell.
2. Now, we can use these _objects_ that we've saved!

One way of "using" these objects is just to see their values. Let's check out what this looks like:

In [None]:
var_float

### Basic Math Operations
Beyond the datatypes and what they mean, Python already knows several types of math/set operations. Let's check it out. A great reference can be found in the [Python Reference API](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex).

In [None]:
# Add together our float and our integer value
var_float + var_integer

In [None]:
# Subtract var_integer from var_float
var_float - var_integer

# Multiply var_integer with var_float
var_float * var_integer

# Quotient of var_float divided by var_integer
var_float / var_integer

### Basic Comparisons
We can see similar built-in behavior with comparisons. Let's quickly also check those out. We will again use the [Python library reference on Comparisons here.](https://docs.python.org/3/library/stdtypes.html#comparisons)

In [None]:
#Determine if one value is larger than another
var_float > var_integer

In [None]:
#Determine whether a numerical object is equal to another
print(var_float == var_integer)

#Determine inequality
var_float != var_integer

# Breakout Room 2: Identifying foundational elements of Python for our Email Example

Assume you have the following code to meet your goals for the preceding example. Using ChatGPT, answer the questions that follow.

```
def process_document(document_string):
    lines = document_string.split('\n')
    
    header_dict = {}
    email_body = []
    
    # Flag to know when the header ends and the body starts
    body_flag = False
    
    for line in lines:
        if not body_flag:
            # Header lines processing
            if ':' in line:
                key, value = line.split(':', 1)
                header_dict[key.strip()] = value.strip()

                if 'Lines' in key:
                    body_flag = True
            else:
                # Header lines without ':' are considered part of the last field
                header_dict[key] += '\n' + line.strip()
                
        else:
            # Body lines processing
            email_body.append(line)
            
    return header_dict, email_body

# Read in the document
with open('document.txt', 'r') as file:
    document_string = file.read()

header, body = process_document(document_string)

print(header)
print(body)

```

**Questions to answer in your breakout rooms using ChatGPT:**
1. What are the two major components of this code?
2. What are some examples of variables used in the code? What do they represent, and what is their purpose? 
3. What are the data structures used in this code? What is the purpose of both data structures (generally)?
4. What are the conditional execution statements used in this example? What do they do?
5. What functions are used in this example?
6. (Reach) What does each grouped section of code do?
7. (Reach) Can you see any pitfalls in this approach?
8. (Reach) What other questions do you have?


# A decomposition of code elements
Let's decompose this code to execute and explore its elements.

## Accessing and downloading data

There are _many_ ways of reading data into Google Colab. You can access data directly:
1. Through data stored on your Google Drive through filepaths
2. Through the Google Drive API
3. Programmatically if the data is stored on the web

We will leverage the last approach here, and use the [20 Newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset. We will use the _command line functionality_ through `curl` to download this data.

In [None]:
# download the data using command line tools
!curl -O http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16.5M  100 16.5M    0     0  13.6M      0  0:00:01  0:00:01 --:--:-- 13.6M


In [None]:
# Untar the data to the current directory
# note that we can use %%capture if we don't want to see all of the output
!tar -zxvf 20news-19997.tar.gz

## Reading in the data to the Python kernel

In [None]:
# Identify document


# Read in the document



## Add the rest of the code

In [None]:
# The function for processing the document
def process_document(document_string):
    lines = document_string.split('\n')
    
    header_dict = {}
    email_body = []
    
    # Flag to know when the header ends and the body starts
    body_flag = False
    
    for line in lines:
        if not body_flag:
            # Header lines processing
            if ':' in line:
                key, value = line.split(':', 1)
                header_dict[key.strip()] = value.strip()

                if 'Lines' in key:
                    body_flag = True
            else:
                # Header lines without ':' are considered part of the last field
                header_dict[key] += '\n' + line.strip()
                
        else:
            # Body lines processing
            email_body.append(line)
            
    return header_dict, email_body

In [None]:
# The actual execution of the processing of the document
header, body = process_document(document_string)

print(header)
print(body)

# Things Python Already Knows: Aggregated Data Elements (Collections)

Beyond single variables, we often want to be able to treat related values as components of a single variable as a _collection_. What the heck does that mean? Let's check out two built-in data structures.


## Lists
Think of the hallway of an office building you're just visiting. I can think of at least 2 ways to name each of the individual offices:

1. **Represent each object individually**: We can use what we've already learned and give each office an individual variable name.
1. **Represent each object as a part of another object**: We could choose to represent all offices of a floor as a batch (single collection), and reference each object by its position on the floor.

Let's compare these approaches:
<center>
<img src="https://github.com/vanderbilt-data-science/p4ai-essentials/blob/main/img/list_type_comparison.png?raw=true" width="800">
</center>

How do we do this in code? You'll have to take my word for it to start off with, but Python offers great functionality related to the list data structure as opposed to manipulating single elements. Let's check it out.

### Creating Lists
Below, we will learn the _syntax_ and _language_ to speak with Python and it understand its tasks to complete.

In [None]:
# How do we make a list?
floor0 = ['judge chambers', 'jury entrance', 'media entrance', 'general entrance', 'gender-neutral restroom']

### Retrieving elements of lists
Coming back to our example...

In [None]:
# How do we reference elements in that list?


In [None]:
# How can we iterate over all values?


### Iteration helpers you'll see frequently
We'll go over this in more detail in a later class, but the following functions are helpful, particularly in iteration:
* `enumerate`: returns both the index (position) in the list and the list value
* `len`: returns the length of the list
* `range`: returns a sequence of numbers (e.g., 1-4)

Make sure to take a look at these in depth tonight for your homework.

In [None]:
# What else is something nice that we can do with that list data structure?
for position, line in enumerate(body):
  print(position, line)

In [None]:
# What else can we do with that list data structure?
for position in range(len(body)):
  print(position, body[position])

We can also change the definition of an individual element by its index.

One thing to remember about lists is that lists are **ordered**, that is, they are in a specific sequence and are ALWAYS in that sequence unless you change it.

## Dictionaries
Think of dictionaries as exactly what the word implies - a dictionary. How do you use a dictionary?

Dictionaries are a set of `keys` (the words) and `values` (the definitions) organized as `key`-`value` pairs. You use a word in a dictionary to look up its value.

Let's return to our hallway example.
<center>
<img src="https://github.com/vanderbilt-data-science/p4ai-essentials/blob/main/img/dictionary_elements.png?raw=true" width="800">
</center>

We know that although a list structure _can_ represent this data, the ordered list isn't actually the _best_ representation we can get. But intuitively, we think of a _room number_ as defining what is in the room. We can use a dictionary to help us with this. Let's check that out.

### Creating dictionaries

In [None]:
# What is the syntax for creating a dictionary?
floor0 = {'001':'judge chambers',
          '002':'jury entrance',
          '003':'janitor closet',
          '004':'general entrance',
          '005':'gender-neutral restroom'}
floor0

{'001': 'judge chambers',
 '002': 'jury entrance',
 '003': 'janitor closet',
 '004': 'general entrance',
 '005': 'gender-neutral restroom'}

In [None]:
#@markdown We can also use other syntax to make dictionaries

# Using dict function
floor0 = dict({'001':'judge chambers',
              '002':'jury entrance',
              '003':'janitor closet',
              '004':'general entrance',
              '005':'gender-neutral restroom'})

#Using a list of tuples, i.e., a list of pairs of elements
floor0 = dict([('001', 'judge chambers'),
               ('002', 'jury entrance'),
               ('003', 'janitor closet'),
               ('004', 'general entrance'),
               ('005', 'gender-neutral restroom')])

floor0

{'001': 'judge chambers',
 '002': 'jury entrance',
 '003': 'janitor closet',
 '004': 'general entrance',
 '005': 'gender-neutral restroom'}

### Retrieving elements of dictionaries

In [None]:
header

In [None]:
# How do we use our dictionaries?


# What if we want to know what the organization is?


### Built-in Functionality
> Iteration, Keys, and Values

In [None]:
# we can use the dictionary for simple iteration over keys



In [None]:
# We can also iterate over the values by using the built-in values function for dictionaries
for heading_value in header.values():
  print(heading_value)

sci.med
cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!darwin.sura.net!sgiblab!sdd.hp.com!decwrl!decwrl!uunet!utcsri!utnut!utzoo!telly!problem!intacc!bed
bed@intacc.uucp (Deb Waddington)
INFO NEEDED: Gaucher's Disease
<1993Mar18.002149.1111@intacc.uucp>
Thu, 18 Mar 1993 00:21:49 GMT
Everywhere
01 Jun 93
bed@intacc.UUCP (Deb Waddington)
Matrix Artists' Network
33


In [None]:
# We can also use the built in items function to retrieve key, value pairs
for heading, heading_value in header.items():
  print(heading, heading_value)

Newsgroups sci.med
Path cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!darwin.sura.net!sgiblab!sdd.hp.com!decwrl!decwrl!uunet!utcsri!utnut!utzoo!telly!problem!intacc!bed
From bed@intacc.uucp (Deb Waddington)
Subject INFO NEEDED: Gaucher's Disease
Message-ID <1993Mar18.002149.1111@intacc.uucp>
Date Thu, 18 Mar 1993 00:21:49 GMT
Distribution Everywhere
Expires 01 Jun 93
Reply-To bed@intacc.UUCP (Deb Waddington)
Organization Matrix Artists' Network
Lines 33


One thing to remember about dictionaries is that dictionaries are **unordered**. The example above shows that the iteration returns the results in the order that we declared them in; but this is not the rule. It just so happens that this occurred, but **order in a dictionary should not be necesssary for code to behave appropriately.** There are different data structures if you need ordered key-value functionality.

# Breakout Room 3: Interactive Prompt (10 minutes)

This is a packed day, so in the case we don't get to it - please do this task for homework.

In this breakout room, you'll use an interactive prompt to check your understanding of the code, using ChatGPT as a helper. An interactive prompt means that you'll have ChatGPT to check your answers as you explain the code to it.

You can use a prompt similar to the following:

_I want to explain the above code to you, and I want you to check to make sure that my answers are correct. If they are wrong, explain to me why they are wrong. Do not give me any hints or more information about any section of the code until I tell you about it. When I say that I'm done explaining, please let me know if there are any important concepts that I have missed while explaining._

In your breakout room document:
1. Write the inversion prompt that you used
2. Write your explanations
3. Write anything ChatGPT has replied that you think may be incorrect or misleading.



# Homework

## Exploring Further
1. If this isn't making sense to you, try the interactive prompt or a version of the interactive prompt so that ChatGPT can help elucidate any questions that you have on the behavior. If it's still unclear, please bring these questions to class.
2. Take a look at some of the linked documents for variables and built-in functions for Python.

## Try it Yourself
1. Consider a code-based task that you want to execute. Ask ChatGPT to write the code for you.
2. Run the code using Google Colab.
3. Repeat our Breakout Room 2 exercise, identifying the individual Pythonic elements of the code, its overall functionality, and functionality of smaller code blocks.
4. Use the inversion prompt to explain in your own words the code back to ChatGPT.

## Preparation for next class
1. Today, we didn't cover functions and conditional execution. Use ChatGPT to provide a general explanation of conditional execution for the example code today and how the function works.
2. Make sure to bring any bugs in ChatGPT statements to class for discussion.