# Introduction to Statistics & Data Analysis

Welcome to the second programming module! This module will cover some basic statistics and data analysis techniques, with some examples in biological datasets. Along the way, we'll be introducing some aspects of the following packages:

1. **Numpy**: This package is the basis for most scientific computing packages in Python. It allows you to quickly and efficiently compute functions of *arrays*, which are collections of values. 
2. **Matplotlib**: This package is built on top of Numpy, and is used in plotting your data.
3. **Pandas**: This package provides tools for working with *tabular* data, or data organized in tables.
4. **Scipy**: This set of packages includes tools for a variety of scientific computing purposes. In particular, it includes tools for working with statistics. 

At the end of the module, we'll provide some resources for further reading that may be useful to you in your projects.

# Descriptive Statistics

In this section of the module, we'll be exploring some basic tools for doing **descriptive statistics** in Python. Make sure to run every chunk unless otherwise stated.

## Python Packages and Numpy

Python's **packages** are collections of code written by other programmers, which are used for specific tasks. Each package can contain special functions and objects needed for a task. For example, as we mentioned in the introduction, the **Numpy** package contains functions for manipulating arrays of data. 

In order to use a python package, we first need an `import` statement. This allows us to load in the functions present in the package and use them. We can do this as follows:

In [None]:
import numpy as np

Note that we `import numpy` as you might expect, but we also have an additional part of the statement: `as np`. This is not required, but is often helpful so that we don't have to type out `numpy` in full every time we need to call a function from the numpy package. You will often see people use the `np` abbreviation for numpy, and there are similar conventions for many other packages. 

There is another form of the import statement, which we **don't recommend using**:

In [None]:
# Don't run this
from math import *

This imports *all* functions from the `math` package (a builtin package in Python that provides some utilities for basic mathematical functions). The key feature of importing like this is that we don't have to type out the word `math` every time we want a function in the `math` package; we can just call it by name directly. The reason this is not recommended is that it can easily introduce **name clashes**, where something in the package has the same name as a function or object you've defined.  If you need a shorter version of the package name, it's recommended to use the `import as` version in the previous chunk.

Now that we've imported the `numpy` package, let's use it for some basic calculations. To begin, let's create an *array* of values in numpy:

In [None]:
x = np.array([0,1,2,3,4,5,6,7,8])

The `np.array()` function (remember that `np` is just an abbreviation for `numpy`) is a function within numpy that creates an array when given a list of values as input. An array is just a collection of values like a list; however, it is often much faster to use and perform mathematical operations on than a basic python list. This is because the code that is used to create and operate on these arrays can operate on all of the elements of the array at the same time, and uses code written in C or Fortran (programming languages that are *compiled* into executable programs and can have much better performance than python for tasks like this).

Let's check the type of x:

In [None]:
type(x)

Note that the type is `numpy.ndarray`. This type is the array type defined within numpy. The "nd" in the name means N-Dimensional, referring to the fact that numpy arrays can have any number of dimensions. For example, we can "reshape" this array that we've created into a 3x3 matrix as follows:

In [None]:
x = x.reshape(3,3)
x

Note that we've used a *method* here to do this. The `numpy.ndarray` type defines many useful methods for operating on `ndarray` objects, which can be found in the Numpy documentation (which can be accessed in the Help tab of the Jupyter interface). This method in particular, `reshape(x,y)`, reshapes the given array into a new array with `x` rows and `y` columns.

Let's now demonstrate some basic properties of numpy arrays. The first of these is the fact that we can operate on all of the elements of the array simultaneously. Operations like this are called **vectorized**. For example, in "pure" Python, if we wanted to square all of the elements of the list, we might do so with a `for` loop, as follows:

In [None]:
x1 = [1,2,3,4,5]
for i in range(len(x1)): # Loop over all indices in x1
    x1[i] = x1[i]**2
    
x1

However, with a numpy array, we can simply do the following:

In [None]:
x2 = np.array(x1) # Create a numpy array out of x1
x2 = np.power(x2, 2) # Raises every element in x2 to the 2nd power
x2

We're not just limited to operations like this however; vectorization extends even to arithmetic operations between arrays. For example, we can do the following:

In [None]:
# Multiply an array by a number:
2*x2

In [None]:
# Multiply two arrays of the same dimension:
x3 = np.array(x1)
x2*x3

Lastly, there are some functions we can use to create specific types of arrays, which are often useful to us:

In [None]:
# Create a 3x3array of just zeros
zeros = np.zeros((3,3)) # Notice the second set of parentheses
print(zeros)

# Create a 4x1 array of just ones
ones = np.ones((4,))
print(ones)

# Create a sequential array of values
ranged = np.arange(0, # Start (included)
                   10, # Stop (excluded)
                   1) # Step between values
print(ranged)

# Another type of ranged array
# In this case, we are creating a set of evenly-spaced points
spaced = np.linspace(-1,1,5) # 5 evenly-spaced points on the interval [-1,1]
print(spaced)

# Create an array of random entries
# Drawing from a *uniform* distribution, where each value is equally likely:
rand_unif = np.random.rand(1,5)
print(rand_unif)
# Drawing from a *normal* distribution (bell curve) with mean 0 and standard dev. 1:
rand_norm = np.random.randn(1,5)
print(rand_norm)

There are many more useful functions available to use for numpy arrays. If you will be encountering them a lot in your project, we highly recommend going through the Numpy documentation and tutorials to learn more. In the next section, we'll use numpy to explore some basic statistical calculations.

## Descriptive Statistics with Numpy

Let's first recall some basic descriptive statistics. Measures of **central tendency** are used to find values that can be used as a "center" for the data. They estimate the value of the entire sample of data with a single number. These include:

1. The **mean** (often symbolized $\mu$ or $\bar{x}$, which is just the average value of the data (add up all of the data points and divide by the number of data points).
2. The **median** of the data, which is the "middle value" of the data. This can be found by ordering all of the data values from highest to lowest, and selecting the center value. If there are an even number of data points, the mean is the average of the middle two values.
3. The **mode** of the data, which is the most common value in the dataset.

We can calculate the mean and median for a numpy array with the following numpy functions:

In [None]:
# Make sure you've ran the import chunk in the previous section
x1 = np.random.randn(1,100)
mean = np.mean(x1)
med = np.median(x1)

print("Mean: {0}, \nMedian: {1}".format(mean,med)) 
# Here we are using the string format method to substitute our values
# Check the python documentation for strings to learn more

Next, we can calculate some measures of **variability**, which measure how spread-out the values of a dataset are. Recall that these include the following:

1. Standard Deviation: This is a common measure of the spread of the data. It is calculated by taking the difference between each data point and the mean, squaring them and adding the squared differences, dividing this sum by the number of data points n (for an entire population) or n-1 (for a sample), and then taking the square root of the result. In equation form, this is:
$$\sigma = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2}$$
for a sample, where $\sigma$ is a symbol commonly used for the standard deviation, and the $\sum$ symbol refers to taking the sum of the given expression for all values of $i$ up to $n$ (each $x_i$ is one of the data points). The quantity $\sigma^2$ is often called the **variance**.
2. Interquartile Range: This is the difference between the 25th and 75th percentile of the data (i.e. a range in which the middle 50% of the data lies).

We can calculate the sample standard deviation quickly on a numpy array as follows:

In [None]:
std = np.std(x1)
print(std)

We can see that our array has a standard deviation of about 1, corresponding to the method we used to generate it (drawing from a normal distribution with standard deviation 1).

Of course, most data you encounter will not be in the form of a numpy array you've created, but as data stored in tables of values. The next section will go over some ways of dealing with such data using the **pandas** package.

## Pandas and Loading in Data

As mentioned previously, **pandas** is a library used for manipulating data in the form of tables. You will often encounter data stored this way. For example, one common method of data storage is in the form of **CSV**, or **comma-separated values** files, where each column of the data is separated with a comma. We've included some such files with this module, so we can test out reading and manipulating this data.

We'll start with a file called `ChickWeight.csv`, which is a file (included with this module) that gives the weights of various chickens over time when they were placed on different diets. To load in this file, we first need to import the `pandas` package (often abbreviated `pd`):

In [None]:
import pandas as pd

From here, we can use the `read_csv` function within pandas, which reads CSV files and converts them to pandas **DataFrames** (pandas's representation of a table of data):

In [None]:
chick_weight = pd.read_csv('ChickWeight.csv')
chick_weight

From this printed data (which Jupyter formats nicely for us), we can observe a few things:

- There are a total of 578 rows and 5 columns to the data.
- The columns are, in order: `Unnamed` (corresponding to the index), `weight` (corresponding to the weight measurement of a chicken), `Time` (corresponding to the time a weight measurement was taken starting at time 0), `Chick` (corresponding to the identity of the particular chicken being measured), and `Diet` (corresponding to the number for the diet the chicken was on).
- There seem to be a total of 50 different chickens and 4 different diets.

We can find some basic information about the dataset as follows:

In [None]:
# Type of the data:
type(chick_weight)

In [None]:
# The "shape" of the data, which is the dimension of the dataframe
chick_weight.shape

In [None]:
# The data's columns:
chick_weight.columns

In [None]:
# We can view the first few values of the data using the "head" method
# This is useful when you've loaded a large dataset and just want to know what it "looks like"
chick_weight.head()

In [None]:
# We can also view a quick overview of summary statistics for a dataframe with the `describe` method
# As you will see, this is not always useful for every column
chick_weight.describe()

We can select values from the dataframe using the column names. For example, suppose that we wanted to select all values in the `weight` column, and find their mean. We could do this as follows:

In [None]:
chick_weight['weight'].mean()

In reality, we will often want to select data in more complicated ways. For example, we may want to filter out the data to only keep some that matches a certain criterion, and then perform some calculations on that data. Here, for example, we might want to know the mean starting weight for the chickens (i.e. their mean weight at time 0). To do this, we can do the following:

In [None]:
chick_weight[chick_weight['Time'] == 0]['weight'].mean()

Let's break down how this syntax works. We can start with the first piece, `chick_weight[chick_weight['Time'] == 0]`. If we examine the expression within the brackets, we get:

In [None]:
chick_weight['Time'] == 0

As we can see, this is an array of **boolean** values. For each of the values in the column, it has a value of `True` if the value is equal to 0 (`==0`), and `False` otherwise. In other words, we've performed the comparison in a *vectorized* manner across all entries in the column. The selection code then selects only the values in the dataframe where the value is `True`:

In [None]:
chick_weight[chick_weight['Time'] == 0].head()
# Using the head function to avoid printing too many values

We can see that the values of the index selected correspond to the `True` values in the previous chunk. Since we've returned an entire dataframe here, and we're only interested in the `weight` column, we can extract the weights the usual way (using `['weight']`), and then use the `mean()` method to find the mean.

We can also *group* by certain criterion when performing calculations. For example, we could split this dataframe by the individual chicken, and calculate the mean weights of each chicken throughout the entire time period. We can do this using the `groupby` method:

In [None]:
groupby_chicken = chick_weight.groupby('Chick')

# If we now use the mean method, it will automatically respect our groupings
groupby_chicken.mean()

Pandas dataframes also have the ability to produce some basic plots. As an example, we can plot boxplots for each diet's weight as follows:

In [None]:
groupby_diet = chick_weight.groupby('Diet')
box = groupby_diet.boxplot(column="weight", figsize=(10,10))

Pandas plots use the package `matplotlib` under the hood. Matplotlib is an extremely useful package for generating plots of all kinds, and we highly recommend you learn more about it when creating plots for your project.

## Hypothesis Tests

A common task in data analysis is to conduct a **hypothesis test**. Recall that this is a statistical procedure where we test whether the data support a **hypothesis** (a testable prediction about our results). For this section, we will be looking at a different dataset:

In [None]:
tooth_growth = pd.read_csv("ToothGrowth.csv", sep=",")
# Note the optional "sep" argument
# This is useful if the values are separated by a character other than a comma
tooth_growth.head(15)

This dataframe describes the effect of vitamin C on the tooth growth of 60 guinea pigs. The `len` column measures the length of the tooth, the `supp` column describes whether vitamin C was delivered as orange juice (labeled `OJ`) or ascorbic acid (`VC`), and the `dose` column gives the daily dose of vitamin C in milligrams.

Given this data, we might want to test, for example, whether there was an overall difference in tooth length between the VC and OJ groups. To do this, we can use a **two-sample t-test**. This is a kind of statistical test that tests whether the difference between the means of two different groups is significant. The function for conducting a 2-sample t-test is located in the **scipy.stats** package:

In [None]:
from scipy import stats

This import statement imports the `scipy.stats` submodule *from* the `scipy` package, which is a collection of submodules for various scientific tasks.

From here, we can use the `stats.ttest_ind()` function for our test:

In [None]:
# We can describe the data before conducting the test as follows:
tooth_growth.groupby('supp').describe()['len']

In [None]:
# First, we can create arrays to store the data we want
VC_len = tooth_growth[tooth_growth['supp'] == 'VC']['len']
OJ_len = tooth_growth[tooth_growth['supp'] == 'OJ']['len']

# Then, we can run the test
stats.ttest_ind(VC_len, OJ_len)

In general, hypothesis tests work by the following procedure:

1. Choose a **test statistic**, which is a measurement calculated from the data. This test statistic is assumed to follow a certain **probability distribution** (which describes the probability of a given value of the test statistic) under the null hypothesis.
2. Calculate the test statistic for our data.
3. Find the **p-value**, which is the probability of getting a test statistic *at least as extreme* as the one we observe.
4. Based on this p-value, we can decide whether or not to reject the null hypothesis.

In our example, the test statistic is a 2-sample *t-statistic*, which is used when testing for a difference in means between two samples. Since we are testing for a difference between the means of the two conditions, our **null hypothesis** would be that there is no difference (i.e. a difference of 0), and our **alternate hypothesis** is that there is a nonzero difference (review the slides from this week's lecture for an overview of these definitions).

The t-statistic is said to follow the [**Student's t-distribution**](https://en.wikipedia.org/wiki/Student%27s_t-distribution), which is a probability distribution that looks somewhat like a normal distribution (a bell curve). In fact, the t-distribution has an additional parameter called the *degrees of freedom*, and when this value is very large the t-distribution is approximately eequal to the normal distribution. To use this test statistic, we have to make a few assumptions. First, we assume that the sample size is small (<30), since in higher cases we may want to directly use a normal distribution. In addition, we assume that the population's standard deviation for the measured value is known and equal among the two datasets. While we cannot necessarily always assume this, in practice if the sample standard deviations are close together we can approximate the population standard deviations as also being similar as long as we note this assumption (in our case, we can see that the two standard deviations are somewhat different, so we might want to examine this assumption). For more information, see the [Wikipedia article on Student's t-tests](https://en.wikipedia.org/wiki/Student's_t-test).

Once we have calculated the test-statistic, we can calculate the **p-value**, which is the probability of obtaining a test statistic at least as extreme as the one we calculated under the null distribution. Essentially, we are calculating whether the results we obtained are likely to have occurred due to random chance, in the event that our null hypothesis (that there is no difference between the groups) is correct. Something important to note here is that we did not test whether one group specifically had a higher or lower length on average. This means we are conducting what is called a **two-sided test**, so when calculating the probability of obtaining our t-statistic, we calculate the probability that the difference in means observed is at least as large as the one we see. A **one-sided** test might, for example, calculate the probability that the OJ group is specifically at least as much larger than the VC group under the null model as is found here.

From our test results, we are given a test statistic and a p-value. We can see that the p-value is roughly 0.06. Remember from lecture that a standard cutoff for the p-value indicating a significant result is <0.05. Since this p-value does not meet this criterion, we *fail to reject the null hypothesis* and conclude that there is insufficient evidence to suggest a difference between the two groups. Of course, there are factors we did not take into account, such as dosage, so a researcher may conduct more sophisticated testing to account for this.

There are a wide variety of statistical tests for different hypotheses. Some commonly used ones include:

- **One-sample t-tests**, which test whether a population value is significantly different from a given value (for example, if the mean is significantly different from 0). The scipy function for this is `stats.ttest_1samp`.
- **Paired two-sample t-tests**, which are like two-sample t-tests but test for repeated measurements on the same individual. The scipy function for this is `stats.ttest_rel`.
- **Chi-squared tests**, which test whether **categorical data** (data that is sorted into specific categories) have frequencies/counts that arose by chance (for example, testing whether the rolls of a dice are biased by counting the number of rolls of each number in a given sample and comparing to the expected distribution of counts).

## Correlation and Regression

As we discussed in lecture, **correlation** is a measure of how linearly related two variables are. This is usually expressed by a value called the **correlation coefficient**. The most commonly used such value is **Pearson's correlation coefficient**, represented with an $r$, which ranges from -1 to 1 with values closer to $\pm 1$ indicating a stronger correlation and the sign indicating the direction of the correlation. 

Using our Tooth Growth dataset, we might want to find, for example, the relationship between dose and length for the `VC`-dosed group. We can do this with the `stats.pearsonr()` function, which as the name implies calculates Pearson's r:

In [None]:
# Make sure you've imported scipy.stats as stats
# First, let's subset our data to just the VC group
vc_growth = tooth_growth[tooth_growth['supp']=='VC']

# Next, we can use the pearsonr function:
pearson = stats.pearsonr(vc_growth['len'], vc_growth['dose'])
pearson

This function returns a *tuple*, with the first value corresponding to the r-value and the second value coresponding to a two-tailed p-value. Since our r-value is 0.899, we conclude that there is a positive relationship between dose and tooth length i.e. when dose increases, so does tooth length. Note that the order of the arguments we provide to the `pearsonr` function does not matter:

In [None]:
pearson_switched = stats.pearsonr(vc_growth['dose'], vc_growth['len'])
pearson_switched

As we can see, there is no difference in the final r value; this is true in general of linear regression, since switching which value is considered x and y does not change how strongly the variables are related.

**Regression analysis**, broadly speaking, is the estimation of the relationship between two or more variables. **Linear regression** specifically seeks to find the line of best-fit between two variables. This means that we are fitting a relationship in the form $y = a + bx$, where $x$ is the independent variable and $y$ is the dependent variable. In our case, $x$ would be the dosage, and $y$ would be the resultant tooth length. 

To calculate the line of best-fit in Python, we can use the `scikit-learn` package. This is a package that is considered part of scipy, and contains many models for machine learning applications. It also includes regression models, which we can use for our calculation. Using the same dataframe we just created, we can fit the model as follows:

In [None]:
# We do not need to import all of scikit-learn
from sklearn.linear_model import LinearRegression

# Create our x and y variables
x = vc_growth['dose'].values.reshape((-1,1))
# We need to reshape x since sklearn requires it to be two-dimensional
y = vc_growth['len'].values

# Create a model variable, and then fit it to the data
model = LinearRegression().fit(x,y)

# Get our results from the model
slope = model.coef_ # This is an array
intercept = model.intercept_
r_sq = model.score(x,y)

print("R^2 value: {0}, formula: y = {1} + {2}x".format(r_sq,intercept,slope[0]))

The "score" calculated in the final step is a value known as the **Coefficient of determination**, $R^2$, which measures the proportion of the variance in the dependent variable that is explained by the independent variable. In the case of simple linear regression, it is exactly equal to $r^2$ where $r$ is Pearson's r. In this example, since the $R^2$ is 0.808 we conclude that the variance in dose explains about 80% of the variance in the tooth lengths. We can also see that the model returns a slope and intercept, which can be used to find the best-fit equation.

We can generate a plot of this best-fit line alongside the initial data using the `matplotlib` package, which contains useful functions for plotting. This is done as follows:

In [None]:
import matplotlib.pyplot as plt
# pyplot is a convenient way to use matplotlib functions

plot = plt.scatter(x, y, color = 'b') # Create a scatterplot with the initial data
plot = plt.plot(x, model.predict(x), color = 'k') # Plot the best-fit line
# model.predict gives the predicted y-values using the best-fit line
plt.show()

Something we might notice is that since the dosage data only takes 3 values, linear regression may not be the best way to model this data. In general, you would create a linear model from a set of data if the way you have collected the data supports that sort of model.

# An Example Biological Dataset #

Proteins, the functional molecules of all cells, are often complexes of multiple protein chains. Each protein chain has distinct N-terminus and C-terminus ends. For example, the protein of human hemoglobin A2 [1SI4](https://www.rcsb.org/structure/1SI4) has 4 chains forming one protein. Each chain is colored differently as shown below:

![1SI4 structure](1SI4.png  "Structure of human hemoglobin A2 (PDB ID 1SI4)")

We're curious how many amino acids are needed for a typical protein chain. The sizes of proteins can give hints to their biochemical structure and their biological function. The distribution of protein lengths can also give hints about protein evolution.

We have access to the chain lengths of all entries in the RCSB PDB.

In [None]:
# We import the data using the pandas function `read_csv`.
# The function returns a pandas dataframe
df = pd.read_csv('./chain_lengths.tsv', sep='\\t', engine='python')

# Let's look at the first few lines of the dataframe
df.head()

If you're interested, you can search up any of these proteins in the RCSB PDB database and observe their structure.

For now, we're just curious about the chain lengths and can ignore the first two columns of the data. We can extract just the lengths of all the chains and store it in an array called `lengths`.

In [None]:
lengths = df.loc[df['length'] >= 0, 'length'].values
len(lengths)

If you ran the above cell, you can see that that's a lot of protein chains! The PDB database gets a lot of submissions. Over the decades, it has accumulated hundreds of thousands of resolved protein structures. Some of these structures actually come from different research groups resolving the structure of the same protein. 

We'll be using this `chain_lengths` dataset in the last exercise of this module.

# Exercises

Complete the following exercises in code and markdown chunks below this one.

### Exercise 1 ### 
Create a function that calculates the standard deviation of a numpy array without using the `np.std` or `np.var` functions, and test it on some inputs (hint: you may want to look up some of the mathematical numpy methods). We've included some starter code for you. Replace the ellipsis with your code.

In [None]:
def standard_deviation(arr):
    ...

# small test for your function
a = [1, 2, 3, 4, 5, 6]
assert standard_deviation(a) == np.std(a)

# write some more tests here! you don't need to use the `assert` keyword. you can just print the outputs.

### Exercise 2 ### 
Using either of the two datasets given (`ToothGrowth` or `ChickWeight`), describe a hypothesis you might want to test, write code to test it, and describe your results. Include an explanation of assumptions made. You are not limited to the tests we have described here.

### Exercise 3 ###
Using either of the two datasets given (`ToothGrowth` or `ChickWeight`), describe a correlation you may want to find, write code to calculate this correlation, create a plot to display your results, and then describe your results.

### Exercise 4 ###
One of the most widely used data visualizations is the **histogram**. A histogram shows the distribution of values of a random variable. Here, we see a simple histogram of a sample dataset.

In [None]:
# Fixing random state for reproducibility
np.random.seed(19680801)

a, b = 100, 15
x = a + b * np.random.randn(1000)

plt.figure(figsize=(15, 10))

# the histogram of the data
n, bins, patches = plt.hist(x, bins=10, density=True)

plt.xlabel('Number')
plt.ylabel('Probability')
plt.title('Histogram of Random Numbers Generated Using np.randn')
plt.xlim(40, 160)
plt.ylim(0, 0.03)
plt.grid(True)
plt.show()

Notice the parameters we passed into the `plt.hist` function. `x` is our data in the form of an array, `bins` is the number of bins we want our data to fall into (i.e. we'll see 10 bars in the graph). `density` is a boolean value representing whether or not we want the distribution to be a "probability distribution" such that the y-axis represents the probability of a given value being a certain number. You can tell when a graph has `density=True` because the sum of all y-values will be 1. 

Now, we want to graph the distribution of protein chain lengths on a histogram. 

Replace the ellipsis with a call to `plt.hist` with the appropriate parameters.

Hint: Use the `lengths` variable, and play around with the `bins` parameter. You should get something that looks like this image, but it's okay if it doesn't look exactly like it!

![chain lengths histogram](histogram.png  "Protein Chain Lengths Histogram")

In [None]:
plt.figure(figsize=(15, 10))

# replace the ellipsis with a call to plt.hist
n, bins, patches = ...

plt.xlabel('Protein Chain Lengths')
plt.ylabel('Probability')
plt.title('Histogram of Protein Chain Lengths from the RCSB PDB')
plt.xlim(0, 2100)
plt.ylim(0, 0.003)
plt.grid(True)
plt.show()

Wow, nice histogram!

### Open-ended questions: ###
Please answer in a Markdown cell below.
1. How would you describe the distribution of protein chain lengths? Does it look like a normal distribution? Does it have a skew?
2. What might be some issues with the dataset we chose to work with? (Hints: If we want to conduct evolutionary studies, are we equally representing all organisms? Does the PDB database tend to have more proteins from one species over another?)


## Congrats on finishing Module 4! You're a rockstar. ##

Optional: If you want to play with the protein chain lengths dataframe (`df`) some more, here are some questions that might be fun to explore. You will need to search up pandas functions to answer these questions.
1. Which protein (i.e. which four letter code) has the longest chain?
2. What protein(s) have the shortest chain? Does that chain have any secondary structure?
3. What's the average chain length?
4. Challenge: Do most PDB IDs (the first column of the dataframe) have 1 or 2 chains?