### Rev. 4—050523

# It's time for the world-famous ArnaoutLab interview!

You have **4 days** to complete the questions below.

**Each question consists of several parts.** If you have trouble with one part, just explain what you're doing, what's got you stumped, and what you would do if you could get past the part that's giving you trouble, and move on to the next part.

**These are meant to be challenging!** So don't worry if it feels like a lot. However, we think **you'll also find them rewarding.**

**If you have any questions, or feel you might need more time,** please text Dr. Arnaout at 617-538-5681. We will not be checking email frequently enough to answer questions by email.

**You're free to use any resource** (library documentation, online search, stackoverflow, etc.) as needed to help you solve these problems. That's what we do in the lab. However, we prefer you work by yourself, because we won't necessarily be able to hire the people you worked with. That said, if you had to ask someone for help, just say so: honesty is the best policy.

**Show your work in this Python notebook.** We should be able to run the notebook and see your work in action.

**Please comment your code!** It helps us see how you think.

# Good luck, and have fun!

---

# Question 1

This question is about patterns in protein sequence.

## Part 1: Display

**Retrieve the <a href="https://www.rcsb.org/structure/1igt">Protein Data Bank entry `1IGT`</a>:**

![image.png](attachment:image.png)

**and display it interactively in this Jupyter notebook.**

- You can retrieve it however you like, but we recommend doing it programmatically, for example using Biopython.
- You can display it using whatever viewer you like, but we recommend Nglview; worst-case, feel free to use matplotlib's scatter in 3D and plot the atoms.

## Part 2: Patterns

The `Question_1_data` directory contains two text files, `r1.txt` and `r2.txt`. Each contains a *repertoire* composed of a list of 10,000 variable-length strings. Each string represents a short amino acid sequence that looks something like the below:

    CGRAMCSMYEPFPSSVLEMKITFDYW
    CGRAWMKYSHYVYCYMRFDCKFCYSMMDVW
    CGRLMLFWRQVRDHIRSMIWMVYKGFDYW

Your goal is to describe differences between these two repertoires.

1. Plot histograms of the string-length distributions of each repertoire.
2. Comment on how the length distribution of repertoire 1 compares to the length distribution of repertoire 2.
3. Provide simple summary statistic(s) for each distribution, and in a sentence explain why these statistics provide a reasonable summary.
4. Test the null hypothesis that the two length distributions are no different than would be expected by chance:
    1. Choose a statistical test for this hypothesis
    2. Run that test and provide the p value
    3. Provide a 1-sentence summary of why you chose this test, and how you interpret the result.
5. What is the: 
    1. Shannon entropy of each distribution?
    2. Renyi entropy of order 1 of each distribution?
    3. Inverse Simpson's index of each distribution?
    4. $e$ raised to the Renyi entropy of order 2 of each distribution?

## Part 3: Significance

For the bonus, let's try some subtler patterns. We want to know if there are any 3-letter substrings---i.e., 3-mer amino-acid motifs---that are statistically over-represented in one repertoire, relative to the other, more than expected by chance.

1. Clearly state the null hypothesis you are testing.
2. Describe, in words, your approach to testing this hypothesis.
    1. Specifically describe how you will be sure that any differences you identify would not be expected by chance due to random sampling.
3. Identify any 3-mer motifs that are over-represented in one or the other repertoire. If you find any, list it/them, and indicate which repertoire it/they are over-represented in.
4. For any such motifs, for each motif, what is the probability of the observed level of over-representation? Show what statistical test you used.

---

# Question 2

This question is about images and information.

## Part 1

In the `Question_2_data` directory you will find pngs of 3,671 32x32-pixel emojis in the folder called `Emoji/` (courtesy https://github.com/joypixels/emoji-toolkit):

![image-7.png](attachment:image-7.png)

and a set of text annotations for each one in the .jason file `emoji.json`, specifically in the `name`, `category`, and `keywords` fields.

(Hint: you can load the .json file as follows:

    import json
    with open("emoji.json") as f:
        annotation_dictionary = json.load(f)
).

The task in this section is to calculate and illustrate the similarities between/among emojis in various ways, specifically by:

1. similarity of color profile
2. similarity of label semantic meaning (using the `emoji.json` file)
3. similarity according to a deep-net embedding (feel free use an off-the-shelf and/or pre-trained model)

We leave it to you to interpret these how you will (we want to see what you come up with!).

In each case:

1. describe your approach in words (in markdown blocks and/or in code comments)
2. plot a network representation, with each emoji as a node (choose the cutoff of your choice, or layout with alpha channel for weights; the `networkx` python package may be useful). For example, here is what some network fragments might look like:

<div>
<img width=375px src="attachment:image-4.png">
</div>

## Part 2: Interpretation

This part is about interpretation.

1. Please identify and plot pairs of emojis that differ the most between these three methods (e.g. "emojis x and y have similar colors, but the deep-net hidden layer thinks they are nothing alike").
2. Describe a couple of these differences in a sentence or two.