<a href="https://colab.research.google.com/github/vanderbilt-data-science/ai_summer/blob/main/0_2_py4genai_tests_apis_solns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tests, Conditional Execution, Iteration, and Practice
> Rounding out our Python knowledge for success with langchain

Welcome to our last day of our AI-Assisted Programming crash course! In today's lesson, we'll cover conditional execution, iteration, tests, and get some practice with these concepts. We'll continue learning the syntax and grammar of the Python language to effectively communicate our goals to Python.

In this lesson, you'll learn:
* Communicating conditional execution syntax to Python
* Different ways of communicating iteration to Python
* Testing - how to make sure your code does what it should do
* How to use APIs when genAI fails you!

Let's get started!


# Conditional Execution
Let's start by revisiting our email example. Recall that our function was:

In [None]:
# process document function
def process_document(document_string):
    lines = document_string.split('\n')
    
    header_dict = {}
    email_body = []
    
    # Flag to know when the header ends and the body starts
    body_flag = False
    
    for line in lines:
        if not body_flag:
            # Header lines processing
            if ':' in line:
                key, value = line.split(':', 1)
                header_dict[key.strip()] = value.strip()

                if 'Lines' in key:
                    body_flag = True
            else:
                # Header lines without ':' are considered part of the last field
                header_dict[key] += '\n' + line.strip()
                
        else:
            # Body lines processing
            email_body.append(line)
            
    return header_dict, email_body

We used this function by executing something similar to:
```
# Read in the document
with open('document.txt', 'r') as file:
    document_string = file.read()

# Process the document
header, body = process_document(document_string)

print(header)
print(body)
```

This already has an example of conditional execution, in which a variable is only modified if conditions are met. Let's explore this at greater depth.

## Writing conditional execution code
What we've written above is essentially the form of one or more `if`, `if-else`, `if-elif-else` statements. The syntax for communicating conditional execution looks like so:

### `if` statements
```
if condition:
  #code block for if condition true
```

### `if-else` statements
`if-else` statements allow binary, mutually exclusive decisions:

```
if condition:
  #code block for if condition true
else:
  #code block for if condition false
```

### `if-elif-else` statements

`if-elif-else` statements allow multiple, mutually exclusive decisions:
```
if condition A:
  #code block for if condition A true
elif condition B:
  #code block for if condition B true
elif condition C:
  #code block for if condition C true
else:
  #code block for none of the above conditions are true
```

Let's see what this looks like for our code.

# Tests and Asserts
Tests are code that you write to ensure your code is:
1. Semantically correct
2. Robust to all failure modes (e.g., explore all branches of if statements, etc)

This can be conceived of as formulating **test cases.** How can we do this for our `process_documents` function? We can decide on some tests for point (1) by thinking about what we need the code to do.

## Example Test 0: Basic Structure
> Make sure the inputs and outputs are as expected

* Inputs: string
* Output: dictionary, string (2)

In [None]:
# Formulate test case
input_email = """From: John Doe
To: Jane Doe
Organizer-Names: Friends and Family
Message ID: 253421-2
Reply-To: 
Lines: 3

Dear Jane,
The cat is out of the bag.
John
"""

print(input_email)

From: John Doe
To: Jane Doe
Organizer-Names: Friends and Family
Message ID: 253421-2
Reply-To: 
Lines: 3

Dear Jane,
The cat is out of the bag.
John



In [None]:
# Call the function
processed_email = process_document(input_email)
processed_email

({'From': 'John Doe',
  'To': 'Jane Doe',
  'Organizer-Names': 'Friends and Family',
  'Message ID': '253421-2',
  'Reply-To': '',
  'Lines': '3'},
 ['', 'Dear Jane,', 'The cat is out of the bag.', 'John', ''])

In [None]:
# Verify the behavior
def test_proc_doc_io(fn_outputs):
  assert len(fn_outputs)==2, 'The function did not return the correct number of elements'
  assert isinstance(fn_outputs[0], dict), 'The function did not return the first element as a dictionary'
  assert isinstance(fn_outputs[1], str), 'The function did not return the second element as a string.'

test_proc_doc_io(processed_email)

AssertionError: ignored

## Example Test 1: Semantics
> Making sure input and output values are correct

* We will use the same `input_email` from before
* We will use the same `processed_email` from before

Both are repeated here for clarity

In [None]:
# Original email
input_email = """From: John Doe
To: Jane Doe
Organizer-Names: Friends and Family
Message ID: 253421-2
Reply-To: 
Lines: 3

Dear Jane,
The cat is out of the bag.
John
"""

print(input_email)

From: John Doe
To: Jane Doe
Organizer-Names: Friends and Family
Message ID: 253421-2
Reply-To: 
Lines: 3

Dear Jane,
The cat is out of the bag.
John



In [None]:
# Formulate the expected outputs
true_out_header = {'From': 'John Doe',
  'To': 'Jane Doe',
  'Organizer-Names': 'Friends and Family',
  'Message ID': '253421-2',
  'Reply-To': '',
  'Lines': '3'}

true_out_body = """
Dear Jane,
The cat is out of the bag.
John
"""

In [None]:
# Call the function
processed_email = process_document(input_email)
processed_email

({'From': 'John Doe',
  'To': 'Jane Doe',
  'Organizer-Names': 'Friends and Family',
  'Message ID': '253421-2',
  'Reply-To': '',
  'Lines': '3'},
 ['', 'Dear Jane,', 'The cat is out of the bag.', 'John', ''])

In [None]:
#separate the components of the output tuple
proc_out_header, proc_out_body = processed_email

assert proc_out_header == true_out_header, 'The header of the function output is incorrect.'
assert proc_out_body == true_out_body, 'The body of the function output is incorrect.'

AssertionError: ignored

## Example Test 2: Robustness
> What if the email has no body and is only a header?



In [None]:
# Formulate test case
input_email = """From: John Doe
To: Jane Doe
Organizer-Names: Friends and Family
Message ID: 253421-2
Reply-To: 
Lines: 3"""

In [None]:
# Call the function
out_header, out_body = process_document(input_email)

In [None]:
# What should be true about this, then?
assert out_header, 'The header should not be empty for an email which has header'
assert not out_body, 'The list should be empty for an email with no body'

## Breakout Session 1 (10 mins)
In this breakout session, you will continue the process of generating and writing tests/test cases for the provided code. Your assignments are as follows:

1. Write a test that ensures that the code can run on the standard header provided in one of the emails.
2. Write a test that ensures that the code can run with no header and only a message body.
3. Generate your own test to test the robustness of the function (either you can generate this or experiment by asking your GenAI).
4. (If time) Fix the code for where the asserts fail.

In [None]:
# Solution 1
standard_header_email = """Newsgroups: sci.med
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!darwin.sura.net!sgiblab!sdd.hp.com!decwrl!decwrl!uunet!utcsri!utnut!utzoo!telly!problem!intacc!bed
From: bed@intacc.uucp (Deb Waddington)
Subject: INFO NEEDED: Gaucher's Disease
Message-ID: <1993Mar18.002149.1111@intacc.uucp>
Date: Thu, 18 Mar 1993 00:21:49 GMT
Distribution: Everywhere
Expires: 01 Jun 93
Reply-To: bed@intacc.UUCP (Deb Waddington)
Organization: Matrix Artists' Network
Lines: 33


Dear Cat,

Please stop eating my food. You have your own. It doesn't matter that I eat yours.

Sincerely,

Dog

"""

true_standard_header = {'Newsgroups': 'sci.med',
 'Path': 'cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!darwin.sura.net!sgiblab!sdd.hp.com!decwrl!decwrl!uunet!utcsri!utnut!utzoo!telly!problem!intacc!bed',
 'From': 'bed@intacc.uucp (Deb Waddington)',
 'Subject': "INFO NEEDED: Gaucher's Disease",
 'Message-ID': '<1993Mar18.002149.1111@intacc.uucp>',
 'Date': 'Thu, 18 Mar 1993 00:21:49 GMT',
 'Distribution': 'Everywhere',
 'Expires': '01 Jun 93',
 'Reply-To': 'bed@intacc.UUCP (Deb Waddington)',
 'Organization': "Matrix Artists' Network",
 'Lines': '33'}

true_standard_body = """

Dear Cat,

Please stop eating my food. You have your own. It doesn't matter that I eat yours.

Sincerely,

Dog

"""

standard_header, standard_body = process_document(standard_header_email)
assert true_standard_header == standard_header, "Headers do not match"
assert true_out_body == standard_body, "Bodies don't match"


AssertionError: ignored

In [None]:
# Solution 2
no_header_email = """Dear Cat,

Please stop eating my food. You have your own. It doesn't matter that I eat yours.

Sincerely,

Dog

"""

no_header_header, no_header_body = process_document(no_header_email)
assert not no_header_header, "The header should be empty since there isn't one"
assert no_header_body, "The body should exist and not be empty"

In [None]:
# Possible solution 3 : Check that none of the keys or values of the header have whitespace
assert not [1 for key, val in enumerate(standard_header) if str(key).strip()!=str(key) or str(val).strip()!=str(val)], "Whitespace found in key or value of header"


In [None]:
# Possible solution for 4 - Thanks, GPT4!
def process_document(document_string):
    lines = document_string.splitlines(True)
    
    header_dict = {}
    email_body = []
    
    # Flag to know when the header ends and the body starts
    body_flag = False
    
    # Keep track of the last header key
    last_key = None

    for line in lines:
        if not body_flag:
            if ':' in line:
                key, value = line.split(':', 1)
                header_dict[key.strip()] = value.strip()
                last_key = key.strip()

                if 'Lines' in key:
                    body_flag = True
            else:
                # If a header was previously defined, continue appending the content
                if last_key:
                    header_dict[last_key] += '\n' + line.strip()
                else:
                    # If no header was previously defined, it's the start of the body
                    body_flag = True
                    email_body.append(line)
                
        else:
            # Body lines processing
            email_body.append(line)
            
    return header_dict, ''.join(email_body)


**Run most of the original tests, making sure to run the problematic ones.**

In [None]:
# Run tests
input_email = """From: John Doe
To: Jane Doe
Organizer-Names: Friends and Family
Message ID: 253421-2
Reply-To: 
Lines: 3

Dear Jane,
The cat is out of the bag.
John
"""

# Formulate the expected outputs
true_out_header = {'From': 'John Doe',
  'To': 'Jane Doe',
  'Organizer-Names': 'Friends and Family',
  'Message ID': '253421-2',
  'Reply-To': '',
  'Lines': '3'}

true_out_body = """
Dear Jane,
The cat is out of the bag.
John
"""

# Call the function
processed_email = process_document(input_email)
processed_email

# Basic struture
assert len(processed_email)==2, 'The function did not return the correct number of elements'
assert isinstance(processed_email[0], dict), 'The function did not return the first element as a dictionary'
assert isinstance(processed_email[1], str), 'The function did not return the second element as a string.'

# Semantics 1
assert processed_email[0] == true_out_header, 'The header of the function output is incorrect.'
assert processed_email[1] == true_out_body, 'The body of the function output is incorrect.'

**If important, make sure we don't error out on emails with no header**

In [None]:
no_header_email = """Dear Cat,

Please stop eating my food. You have your own. It doesn't matter that I eat yours.

Sincerely,

Dog

"""

no_header_header, no_header_body = process_document(no_header_email)
assert not no_header_header, "The header should be empty since there isn't one"
assert no_header_body, "The body should exist and not be empty"

**Yay!**

# Iteration
We've already seen some examples of iteration, where we need to cycle through a collection data structure to apply statements to one or more of the elements of the collection (e.g. dictionaries, lists).

There are 2 primary ways that you see iteration in Python:
* `for` loops
* `comprehensions` (list, dictionary, generator)

A final type that we will see _at length_ with Huggingface is:
* `map`

Let's explore this.

## `for` loops
We've already seen some examples of `for` loops when we were learning about lists and dictionaries.

We said that our `for` goes through cyclical iterations, updating the index to process each element.

Our syntax was as follows:
```
for dummy_name in collection:
  ## indented code block steps to take
```

Let's explore this.

### Finishing our email processing and exploration
Although we've read in one email, recall that we wanted to process all (we'll do a subset here for the sake of time) of the documents in the folder. Let's try this using the syntax that we learned for iteration.

In [None]:
import glob

In [None]:
# download the data using command line tools
!curl -Os http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz

In [None]:
# Untar the data to the current directory
# note that we can use %%capture if we don't want to see all of the output
!tar -zxf 20news-19997.tar.gz

In [None]:
#use glob to enumerate files
email_paths_list = glob.glob('20_newsgroups/sci.med/*')

#print length
print('Number of emails:', len(email_paths_list))

# Print first 10
email_paths_list[:10]

Number of emails: 1000


['20_newsgroups/sci.med/58146',
 '20_newsgroups/sci.med/58867',
 '20_newsgroups/sci.med/58068',
 '20_newsgroups/sci.med/59223',
 '20_newsgroups/sci.med/59501',
 '20_newsgroups/sci.med/58763',
 '20_newsgroups/sci.med/59467',
 '20_newsgroups/sci.med/58089',
 '20_newsgroups/sci.med/58986',
 '20_newsgroups/sci.med/58800']

In [None]:
#write for loop to iterate through elements and do the task
max_emails = 20
all_headers = []
all_bodies = []
all_emails = []

for email_path in email_paths_list[:max_emails]:
  with open(email_path, 'r') as f:
    this_email = f.read()
  
  #add the email
  all_emails.append(this_email)

  #process the email
  email_header, email_body = process_document(this_email)

  #save for later
  all_headers.append(email_header)
  all_bodies.append(email_body)

In [None]:
#show email
print(all_emails[19])

#show header
display('\nPrinting Header......', all_headers[19], '\n')

#show body
print('\nPrinting body.........', all_bodies[19])

Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!uunet!portal!cup.portal.com!amigan
From: amigan@cup.portal.com (Mike - Medwid)
Newsgroups: sci.med
Subject: Re: Emphysema question
Message-ID: <79723@cup.portal.com>
Date: Sat, 17 Apr 93 09:36:19 PDT
Organization: The Portal System (TM)
Distribution: na
References: <8944@blue.cis.pitt.edu>
  <1993Apr15.180621.29465@radford.vak12ed.edu> <9072@blue.cis.pitt.edu>
Lines: 11

Thanks to all who replied to my initial question.  I've been away in 
New Jersey all week and was surprised to see all the responses
when I got back.  

To the person asking about nicotine patches, there are four on the
market:

Habitrol - Ciba Pharmaceuticals
Nicoderm - Marion Merill Dow (Alza made)
Nicotrol - Warner Lambert (Cygnus made)
ProStep - Made by Elan and marketed by ??



'\nPrinting Header......'

{'Path': 'cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!uunet!portal!cup.portal.com!amigan',
 'From': 'amigan@cup.portal.com (Mike - Medwid)',
 'Newsgroups': 'sci.med',
 'Subject': 'Re: Emphysema question',
 'Message-ID': '<79723@cup.portal.com>',
 'Date': 'Sat, 17 Apr 93 09:36:19 PDT',
 'Organization': 'The Portal System (TM)',
 'Distribution': 'na',
 'References': '<8944@blue.cis.pitt.edu>\n<1993Apr15.180621.29465@radford.vak12ed.edu> <9072@blue.cis.pitt.edu>',
 'Lines': '11'}

'\n'


Printing body......... 
Thanks to all who replied to my initial question.  I've been away in 
New Jersey all week and was surprised to see all the responses
when I got back.  

To the person asking about nicotine patches, there are four on the
market:

Habitrol - Ciba Pharmaceuticals
Nicoderm - Marion Merill Dow (Alza made)
Nicotrol - Warner Lambert (Cygnus made)
ProStep - Made by Elan and marketed by ??



## List Comprehensions
Another very compact way to represent for loops is through list comprehensions. They're great if you:
* Essentially have one function to apply to a list of elements
* Want to do binary conditional execution on elements of a list
* Want to perform filtering of elements (reduce the size of the list based on some condition)

The difficulty of list and dictionary comprehensions is the syntax because of the concise expression of the for loop. Let's take a look, but it offers wonderful functionality. It streamlines the creation of new lists based on old lists. Let's look at a brief comparison.

<center>
<img src="https://github.com/vanderbilt-data-science/p4ai-essentials/blob/main/img/iteration_comparison.png?raw=true" width="800">
</center>

### Converting part of our `for loop` to list comprehension
Yep, that's right - we're doing the same exact example again, except using list comprehensions. Don't think about the outputs of this too hard, this is more like gaining experience with the syntax of list comprehensions rather than any output functionality.

Recall that we had:
```
#write for loop to iterate through elements and do the task
max_emails = 20
all_headers = []
all_bodies = []
all_emails = []

for email_path in email_paths_list[:max_emails]:
  with open(email_path, 'r') as f:
    this_email = f.read()
  
  #add the email
  all_emails.append(this_email)

  #process the email
  email_header, email_body = process_document(this_email)

  #save for later
  all_headers.append(email_header)
  all_bodies.append(email_body)
```
We can re-express this as a list comprehension. We will start with a single portion, using the existing `all_emails` list. You'll investigate the use of list comprehensions on other components in a breakout room later.



In [None]:
# recall all_emails, where we'll show the first 2
all_emails[:2]

['Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!news.hi.com!duca.hi.com!not-for-mail\nFrom: wright@duca.hi.com (David Wright)\nNewsgroups: sci.med\nSubject: Re: Name of MD\'s eyepiece?\nDate: 6 Apr 1993 16:31:02 -0000\nOrganization: Hitachi Computer Products, OSSD division\nLines: 21\nMessage-ID: <1pspa6INNfik@duca.hi.com>\nReferences: <lr3uceINNc0r@news.bbn.com> <C4IHM2.Gs9@watson.ibm.com> <19387@pitt.UUCP>\nNNTP-Posting-Host: duca.hi.com\n\nIn article <19387@pitt.UUCP> geb@cs.pitt.edu (Gordon Banks) writes:\n>In article <C4IHM2.Gs9@watson.ibm.com> clarke@watson.ibm.com (Ed Clarke) writes:\n>>|> |It\'s not an eyepiece.  It is called a head mirror.  All doctors never\n>>\n>>A speculum?\n>\n>The speculum is the little cone that fits on the end of the otoscope.\n>There are also vaginal specula that females and gynecologists are\n>all too familiar with.\n\nIn fairness, we should note that if you look up "

In [None]:
# Choose to express as a list comprehension
processed_emails = [process_document(email_in) for email_in in all_emails[:2]]

In [None]:
# Get headers and emails separately
all_headers, all_bodies = zip(*processed_emails)

In [None]:
# Check
display(all_headers)
print()
display(all_bodies)

({'Path': 'cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!news.hi.com!duca.hi.com!not-for-mail',
  'From': 'wright@duca.hi.com (David Wright)',
  'Newsgroups': 'sci.med',
  'Subject': "Re: Name of MD's eyepiece?",
  'Date': '6 Apr 1993 16:31:02 -0000',
  'Organization': 'Hitachi Computer Products, OSSD division',
  'Lines': '21'},
 {'Xref': 'cantaloupe.srv.cs.cmu.edu misc.consumers:67737 sci.med:58867 rec.food.cooking:65271 sci.environment:29473',
  'Path': 'cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!zabriskie.berkeley.edu!spp',
  'From': 'spp@zabriskie.berkeley.edu (Steve Pope)',
  'Newsgroups': 'misc.consumers,sci.med,rec.food.cooking,sci.environment',
  'Subject': 'Re: Is MSG sensitivity superstition?',
  'Date': '17 Apr 1993 01:52:00 GMT',
  'Organization': 'U.C. Ber




('Message-ID: <1pspa6INNfik@duca.hi.com>\nReferences: <lr3uceINNc0r@news.bbn.com> <C4IHM2.Gs9@watson.ibm.com> <19387@pitt.UUCP>\nNNTP-Posting-Host: duca.hi.com\n\nIn article <19387@pitt.UUCP> geb@cs.pitt.edu (Gordon Banks) writes:\n>In article <C4IHM2.Gs9@watson.ibm.com> clarke@watson.ibm.com (Ed Clarke) writes:\n>>|> |It\'s not an eyepiece.  It is called a head mirror.  All doctors never\n>>\n>>A speculum?\n>\n>The speculum is the little cone that fits on the end of the otoscope.\n>There are also vaginal specula that females and gynecologists are\n>all too familiar with.\n\nIn fairness, we should note that if you look up "speculum" in the\ndictionary (which I did when this question first surfaced), the first\ndefinition is "a mirror or polished metal plate used as a reflector in\noptical instruments."\n\nWhich doesn\'t mean the name fits in this context, but it\'s not as far\noff as you might think.\n\n  -- David Wright, Hitachi Computer Products (America), Inc.  Waltham, MA\n     wri

## Breakout Session 2 (10 minutes)

In this breakout session, you'll re-express the code we had before as a set of list comprehensions. In this example: try to think through the list comprehension yourself first and write it. If you run into issues, work with GenAI to determine where the error in your logic (or syntax) exists.

**In your breakout room:**
1. Write a function `read_docs` which will read a single email given a file path. It should return the email.
2. Create a list comprehension which reads all (20 for the sake of time) of the emails. Returned should be a list of all the emails (`all_emails`) as above. Recall that we have `email_paths_list`.
3. Leverage the list comprehension written above to process all the emails, and separate into `all_headers` and `all_bodies`, and condense 1-3 into a single cell.
4. Make sure your code runs, and ensure the output is as you expect.
5. Compare and contrast the approaches of the `for` loop and the list comprehension. What are the advantages/disadvantages of both?
6. (If time) Create a function which contains all of the functionality described above and the `glob` functionality. Your function should take as input the directory to be read and the number of emails that should be returned.

In [None]:
# Solution 1
def read_doc(filepath):
  with open(filepath, 'r') as f:
    this_email = f.read()
  
  return this_email

# Check
read_doc(email_paths_list[0])

'Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!news.hi.com!duca.hi.com!not-for-mail\nFrom: wright@duca.hi.com (David Wright)\nNewsgroups: sci.med\nSubject: Re: Name of MD\'s eyepiece?\nDate: 6 Apr 1993 16:31:02 -0000\nOrganization: Hitachi Computer Products, OSSD division\nLines: 21\nMessage-ID: <1pspa6INNfik@duca.hi.com>\nReferences: <lr3uceINNc0r@news.bbn.com> <C4IHM2.Gs9@watson.ibm.com> <19387@pitt.UUCP>\nNNTP-Posting-Host: duca.hi.com\n\nIn article <19387@pitt.UUCP> geb@cs.pitt.edu (Gordon Banks) writes:\n>In article <C4IHM2.Gs9@watson.ibm.com> clarke@watson.ibm.com (Ed Clarke) writes:\n>>|> |It\'s not an eyepiece.  It is called a head mirror.  All doctors never\n>>\n>>A speculum?\n>\n>The speculum is the little cone that fits on the end of the otoscope.\n>There are also vaginal specula that females and gynecologists are\n>all too familiar with.\n\nIn fairness, we should note that if you look up "s

In [None]:
# Solution 2
all_emails_bo = [read_doc(fpath) for fpath in email_paths_list[:20]]
len(all_emails_bo)
print(all_emails_bo[0], '\n')
print(all_emails_bo[19])

Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!news.hi.com!duca.hi.com!not-for-mail
From: wright@duca.hi.com (David Wright)
Newsgroups: sci.med
Subject: Re: Name of MD's eyepiece?
Date: 6 Apr 1993 16:31:02 -0000
Organization: Hitachi Computer Products, OSSD division
Lines: 21
Message-ID: <1pspa6INNfik@duca.hi.com>
References: <lr3uceINNc0r@news.bbn.com> <C4IHM2.Gs9@watson.ibm.com> <19387@pitt.UUCP>
NNTP-Posting-Host: duca.hi.com

In article <19387@pitt.UUCP> geb@cs.pitt.edu (Gordon Banks) writes:
>In article <C4IHM2.Gs9@watson.ibm.com> clarke@watson.ibm.com (Ed Clarke) writes:
>>|> |It's not an eyepiece.  It is called a head mirror.  All doctors never
>>
>>A speculum?
>
>The speculum is the little cone that fits on the end of the otoscope.
>There are also vaginal specula that females and gynecologists are
>all too familiar with.

In fairness, we should note that if you look up "speculum" in the
dictiona

In [None]:
# Solution 3-4
#read all emails
all_emails_bo = [read_doc(fpath) for fpath in email_paths_list[:20]]

#process all emails
processed_emails_bo = [process_document(email_in) for email_in in all_emails_bo]

#separate into component pieces
all_headers_bo, all_bodies_bo = zip(*processed_emails_bo)

In [None]:
#inspect
display(all_headers_bo[-3:-1])
print()
display(all_bodies_bo[-3:-1])

({'Path': 'cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!emory!gatech!pitt.edu!pitt!km',
  'From': 'km@cs.pitt.edu (Ken Mitchum)',
  'Newsgroups': 'sci.med',
  'Subject': 'Re: Menangitis question',
  'Message-ID': '<19427@pitt.UUCP>',
  'Date': '6 Apr 93 14:58:33 GMT',
  'Article-I.D.': 'pitt.19427',
  'References': '<C4nzn6.Mzx@crdnns.crd.ge.com>',
  'Sender': 'news@cs.pitt.edu',
  'Reply-To': 'km@cs.pitt.edu (Ken Mitchum)',
  'Organization': 'Univ. of Pittsburgh Computer Science',
  'Lines': '42'},
 {'Newsgroups': 'sci.med',
  'Path': 'cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!howland.reston.ans.net!zaphod.mps.ohio-state.edu!swrinde!network.ucsd.edu!munnari.oz.au!metro!ob1!hawkesbury.uws.EDU.AU!j.thornton',
  'From': 'j.thornton@hawkesbury.uws.EDU.AU (Jason Thornton       x640)',
  'Subject': 'Cancer of the testis',
  'Message-ID': '<j.thornton.15@hawkesbury.uws.EDU.AU>',
  'Sender': 'news@uws.EDU.AU',
  'Organization': 'University of Western Sydney, Hawkesb




('\nIn article <C4nzn6.Mzx@crdnns.crd.ge.com> brooksby@brigham.NoSubdomain.NoDomain (Glen W Brooksby) writes:\n>This past weekend a friend of mine lost his 13 month old\n>daughter in a matter of hours to a form of menangitis.  The\n>person informing me called it \'Nicereal Meningicocis\' (sp?).\n>In retrospect, the disease struck her probably sometime on \n>Friday evening and she passed away about 2:30pm on Saturday.\n>The symptoms seemed to be a rash that started small and\n>then began progressing rapidly. She began turning blue\n>eventually which was the tip-off that this was serious\n>but by that time it was too late (this is all second hand info.).\n>\n>My question is:\n>Is this an unusual form of Menangitis?  How is it transmitted?\n>How does it work (ie. how does it kill so quickly)?\n\nThere are many organisms, viral, bacterial, and fungal, which can\ncause meningitits, and the course of these infections varies\nwidely. The causes of bacterial meningitis vary with age: in adults

# Guided Exploration of HF API

We can see that most of the time, we see code that looks sort of like functions. We see the "contract" components of the new functions that are available to us - the function name, the inputs required and available, and the outputs. Let's experiment with one!

In [None]:
#install packages
%%capture
!pip install transformers

In [None]:
#make package available to python kernel
from transformers import pipeline

In [None]:
# create a pipeline
mdl_pipe = pipeline('text-classification')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [None]:
#when in doubt, check the type
type(mdl_pipe)

transformers.pipelines.text_classification.TextClassificationPipeline

In [None]:
#Use the pipeline for inference
a_text = 'The dog was walking on a beautiful summer day.'
a_text_cls = mdl_pipe(a_text)
a_text_cls

[{'label': 'POSITIVE', 'score': 0.9998334646224976}]

## Guided Example 1
What if we want to return all scores?

In [None]:
#Return all scores
a_text_scores = mdl_pipe(a_text, return_all_scores = True)
a_text_scores



[[{'label': 'NEGATIVE', 'score': 0.00016651903570163995},
  {'label': 'POSITIVE', 'score': 0.9998334646224976}]]

## Guided Example 2
What if we had a list of texts, loaded, for example, from file?

In [None]:
#Using lists
texts = ['The dog was happy and ran along playfully.',
         'The cat glared at me, judging me from afar.',
         'The groundhog peeked its head above the ground.',
         'Opposums are criminally underrated.']
texts_scores = mdl_pipe(texts)
texts_scores

[{'label': 'POSITIVE', 'score': 0.9997424483299255},
 {'label': 'NEGATIVE', 'score': 0.8735092282295227},
 {'label': 'POSITIVE', 'score': 0.9485796093940735},
 {'label': 'NEGATIVE', 'score': 0.9712409377098083}]

## Guided Example 3
Maybe we don't like how this model is performing. Can we use a different model? How?

Reference: [Huggingface Models](https://huggingface.co/models)

In [None]:
#modify code to use different model
cardiff_pipe = pipeline('text-classification', model='cardiffnlp/twitter-roberta-base-sentiment')
cardiff_scores = cardiff_pipe(texts)

Downloading (…)lve/main/config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [None]:
#see results
cardiff_scores

[{'label': 'LABEL_2', 'score': 0.9303200840950012},
 {'label': 'LABEL_0', 'score': 0.5754473805427551},
 {'label': 'LABEL_1', 'score': 0.8416589498519897},
 {'label': 'LABEL_0', 'score': 0.900627851486206}]

# Breakout Session 3 (10 minute): APIs
Now, try this on your own using the following APIs:
* [Langchain Documentation](https://python.langchain.com/en/latest/index.html)
* [HuggingFace Documentation](https://huggingface.co/docs/transformers/index)
* [OpenAI Documentation](https://platform.openai.com/docs/api-reference)

Using one of the quickstart pages, tutorials, answer the following questions:
1. What is the package being used?
2. What is being imported? Does it appear to be imported from a module? Which module and how can you tell?
3. Try to run the quickstart code to try out the example.
4. Summarize what you think the code is doing. Otherwise, use a GenAI to help summarize the behavior.
5. Try to implement something of interest, particularly something that has been released relatively recently. For example, try to create a Huggingface Agent using just the quickstart and API. Then, modify this code so that the models aren't downloaded onto your computer but instead, using the Inference API.

# Congratulations!
You made it through the first crash course with Python for using HuggingFace! You now:

1. How to use Google Colab
2. Have built some intuition around what it is to program and that programming is just another language with syntax, semantics, and grammar
3. Know several standard Python data types and how to use them
4. Know several standard Python data structures and how to use them
5. Have learned how to use functions, what they expect, and what they return
6. Know packages, libraries, modules, classes, functions, and methods all relate and how you can leverage this information to help understand APIs
7. Understood APIs as contracts about what is expected to be input and what should be returned
8. Learned how to communication conditional execution
9. Learned about standard types of iteration with Python

That is A LOT to cover in 3 days - and I'm proud of you for sticking with it!

Next week, we'll delve into LangChain, and we'll use this Python knowledge to help us understand tutorials and grow on our own in looking at the APIs and documentation LangChain, OpenAI, and Huggingface provide.