# Homework 9 Solutions

*Enter your name and EID here*

**This homework is due on April 16, 2019 at 4:00pm. Please submit as a PDF file on Canvas. Before submission, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."**


**Problem 1 (2 points):** Using Biopython and the Pubmed database, calculate the average number of papers per year that Dr. Wilke has published from 2015-2019 (inclusive, so that's 5 years total). 

**Hints**: Dr. Wilke will always appear as "Wilke CO" in the Pubmed database. Also, make sure to set the `retmax` argument to at least `50` in `Entrez.esearch()` so that you retrieve all of the papers. 

In [1]:
# You will need Entrez and Medline to solve this problem
from Bio import Entrez, Medline

Entrez.email = "dariya.k.sydykova@gmail.com"

handle = Entrez.esearch(db="pubmed",  # database to search
                        term="Wilke CO[Author] AND 2015[Date - Publication]:2019[Date - Publication]",  # search term
                        retmax=50 # Maximum number of results to return
                        )
record = Entrez.read(handle)
handle.close()

# search returns PubMed IDs (pmids)
pmid_list = record["IdList"]

# Count the average number of items in pmid_list
average = len(pmid_list)/5

print(average)

8.8


**Problem 2 (4 points):** From the years 2015-2019 (inclusive), how many different co-authors did Dr. Wilke publish with and how many times did Dr. Wilke publish a paper with each co-author? Print out each co-author and the number of times Dr. Wilke published a paper with that co-author. Make sure you don't print the same co-author's name twice.

**Hint**: In class 21, we parsed the results of a literature search with `Medline.parse()`. This allows us to look at the references we found and to retrieve different parts of the reference with a key. For example, to retrieve the abstract, we would write `record['AB']`. You can find a list of possible keys [here](https://www.nlm.nih.gov/bsd/mms/medlineelements.html).

In [2]:
# Your code goes here
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)

# Create an empty dictionary to keep author names and counts
author_dict = {}
for record in records:
     
    # retrieve author names 
    author_lst = record["AU"]
    
    for author in author_lst:
        
        # skip Dr. Wilke's name
        if author == "Wilke CO":
            continue
        
        # check if author is in the dictionary
        if author in author_dict:
            author_dict[author] += 1 # increment the count of author by 1
        else:
            author_dict[author] = 1 # set the count of author to 1

# Close the efetch handle    
handle.close()

# print final journal name and count
print("Dr. Wilke's co-authors are:")
for author in author_dict:
    print(" ", author + ":", str(author_dict[author]) + "x")

Dr. Wilke's co-authors are:
  Johnson MM: 2x
  Houser JR: 2x
  Barclay W: 1x
  Georgiou G: 1x
  Lloyd-Smith JO: 1x
  Kasson PM: 1x
  Lungu OI: 1x
  Knauf GA: 1x
  Krug RM: 1x
  Arnold JJ: 1x
  Herfst S: 1x
  Kerr SA: 1x
  B Kc D: 1x
  Russell CA: 1x
  Lambowitz AM: 1x
  Jack BR: 6x
  Cameron CE: 1x
  Lenoir WF: 1x
  Sawyer SL: 2x
  Needham BD: 1x
  Jewett MC: 1x
  Kachroo AH: 3x
  Carroll SM: 2x
  Belser JA: 1x
  Cunningham AL: 1x
  Papoulas O: 1x
  Ho KS: 1x
  Laurent JM: 3x
  Wu DC: 1x
  Barrick JE: 3x
  Cobey S: 1x
  Dasgupta A: 2x
  Derryberry DZ: 1x
  Smith BL: 5x
  Demogines A: 1x
  Marcotte EM: 6x
  Chapman SD: 1x
  Paff ML: 2x
  Handel A: 1x
  Huang TJ: 1x
  Adami C: 1x
  Jackson EL: 5x
  Yellman CM: 1x
  McWhite CD: 1x
  Woodman A: 1x
  Jiang Q: 1x
  Bull JJ: 2x
  Person MD: 1x
  Raman R: 1x
  Ellington AD: 1x
  Tucker AT: 1x
  Sridhara V: 3x
  Spielman SJ: 10x
  Riley S: 1x
  Maurer-Stroh S: 1x
  Vander Wood D: 1x
  Shahmoradi A: 2x
  Liu W: 1x
  Yao J: 1x
  Bedford T: 2x
  M

**Problem 3 (4 points):** From 2015-2019 (inclusive), how many of Dr. Wilke's papers contain the terms "evolution" or "evolutionary" in the abstract? Use python and **regular expressions** to find an answer.

**Hint:** In a regular expression, you can match the same word with slightly different endings using the "`|`" (or) operator. For example, the regex "bacteri(a|um)" would match both "bacteria" and "bacterium".

In [3]:
# You'll need the module re for regular expressions
import re

handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)

# Start a counter
ab_count = 0

for record in records:
    
    # check if a record has an abstract
    if "AB" in record:
        
        # Check for the term "evolution" or "evolutionary" in the abstract 
        match = re.search(r"evolution(|ary)", record["AB"].lower())
        
        # if "evolution" or "evolutionary" is in the abstract, increment the count by 1 
        if match: 
            ab_count += 1

# Close the efetch handle    
handle.close()

print(ab_count, 'of the abstracts contain "evolution" or "evolutionary"')


27 of the abstracts contain "evolution" or "evolutionary"
