# W200 Python Fundamentals for Data Science, UC Berkeley MIDS
# Final Exam


## Instructions
The final exam is designed to evaluate your grasp of Python theory as well as Python coding.

- This is an individual exam.
- You have 24 hours to complete the exam, starting from the point at which you first access it.
- You will be graded on the quality of your answers.  Use clear, persuasive arguments based on concepts we covered in class.
- While we've left one code/markdown cell for you after each question as a placeholder, some of your answers will require multiple cells to fully respond
- Double click the markdown cells where it says YOUR ANSWER HERE to enter your written answers; if you need more cells for your written answers, please make them markdown cells (rather than code cells)

## Ramiro Cadavid

## 1: General Questions (21 pts )

a) The following method is part of a larger program used by a mobile phone company.  It will work when an object of type MobileDevice or of type ServiceContract is passed in.  This is a demonstration of (select all that apply and state a reason why it applies):

    1. Inheritance
    2. Polymorphism (X)
    3. Duck typing (X)
    4. Top-down design 
    5. Functional programming

In [5]:
# Method:

def add_to_cart(item):
    cart.append(item)
    total += item.price

- a)
    - It is a demonstration of **duck typing** because the method defined accepts any type MobileDevice or ServiceContract whenever these have the method add_to_cart defined. I.e. the method does not check the type of the item that is passes as argument, as long as the item has this method defined.
    - It is a demonstration of **polymorphism** because the fact that both MobileDevice and ServiceContract classes have an add_to_cart method, allows for them to have a common interface, even if this method do not have the same functionality.

b) Suppose you have a long list of digits (0-9) that you want to write to a file.  Would it be more efficient to use ASCII or UTF-8 as an encoding?  How could you create an even smaller binary file to store the information?

- b) 
    - It would be more efficient to use ASCII because each character is encoded in 7 bits, while each character in UTF-8 is encoded in at least 1 byte (8 bits). In this case, where all characters are digits from 0 to 9, UTF-8 encoding would require 8 bits to store each digit. Thus, the additional memory allocated to store each digit in UTF-8 would be $n$ more than in ASCII, where $n$ is the number of digits in the list.

    - An even smaller binary file could be created by encoding each digit using 4 bits, going from the decimal 0 as the binary 0000 to the decimal 9 as the binary 1001. This method would requitre 3 bits less than ASCII and 4 less than UTF-8.

c) You are part of a team working on a spreadsheet program that is written in Python 3.  The program includes several classes to represent different types of objects that fit into a cell of a spreadsheet.  Give a strong argument for why your team should write an abstract base class to represent such objects and give examples of what should go into such an abstract base class.

- c)

Different data types can share many attributes and methods, independent of the specific feature that each type of data encodes. For example, let's suppose that we have two different data types: profession and ethnicity, then instances of each class would both require to be initialized with a value and would have common validations when an object is initialized, would have common attributes related to the type of values that they can store (e.g string characters, maximum length, etc.) and common methods regarding the transformations that can be done to them (for example, split the value of a multi-select respons into multiple values, remove spaces between words, or capitalize letters). This is even the case if the data types differ significantly. For example, all classes may validate that the value attribute of any object is always an ASCII character, or that its length is not longer than a certain number of characters.

Therefore, it does not make sense to recreate these common attributes and methods in each class and it makes more sense to have a base class that groups all of these, and child classes that inherit them; first, because this would be more efficient; second, because it would make the code more readable and less prone to errors; and, third, because it would make the maintainance of the code easier, since a change in one of these common attributes or methods would need to be done only once in the base class, instead of several times in all the the child classes.

d) Explain why NumPy is better than lists for "vectorized" math operations. Give an example of an operation that is either impossible or painful to implement using traditional Python lists compared to NumPy arrays.

- d) 
    - First, because it is more **efficient**. NumPy generalizes scalar operations, applying these to multidimensional arrays at once using a set of highly optimized routines called BLAS (Basic Linear Algebra Subprograms), that make these operations generally faster than using loops with lists.
    
    - Second, because it is more **concise** (i.e. it generally requires less code to perform the same operation). For example, filtering a array-like object to replace the values that are greater than the mean but less than the mean plus one standard deviation of all values in the array could be done relatively easy with NumPy in one line of code with:
    
    `a[(a > a.mean()) | (a < a.mean() + a.std())] = X`
    
    On the contrary (and even more so if the array has more than one dimension), the same operation with a list would require at least one loop for every dimension of the array and several lines of code, making the code not only painful but likely less efficient and prone to errors.

e) We want a list of the numbers that are the square of nonnegative integer less than 10, but whose squares are greater than 10.  The list comprehension below gives an empty list.  Correct it so that we get the desired output: [16, 25, 36, 49, 64, 81].

In [8]:
[x**2 for x in range(10) if x ** 2 > 10]

[16, 25, 36, 49, 64, 81]

f) Explain why the following code prints what it does.

In [9]:
def f(): pass
print(type(f))

<class 'function'>


- f) This code defines an object of class function named `f` and then returns the type of the object `f` (itself), which is the function just defined (that has type 'function'). This is the reason why the output is "class 'function'".

g) Explain why the following code prints something different.

In [8]:
def f(): pass
print(type(f()))

<class 'NoneType'>


- g) This code defines a function named `f` and calls the function `f()` thus returning the type of its output (not the type of of the function object), which is a `NoneType` since the function doesn't have a `return` statement. This is the reason why the two functions return a different output: the first function prints the type of the object `f`, while the second function returns the type of the object returned by function `f`.

## 2: Data Integrity (25 pts)

a) Why is it important to sanity-check your data before you begin your analysis? What could happen if you don't?

- a) 

Any errors and inconsistencies contained in the raw data will be carried over to the analysis. If the data is not checked before the analysis starts two things can happen. In the best case scenario, the errors will be detected later, but this will require a lot of reprocessing, making the process inefficient and prone to additional errors. In the worst case scenario, the errors will pass unnoticed, altering the accuracy of the results. This, in turn can have large consequences depending on the decisions that are made using this data and can reduce the trust that data users have on the results produced by a data science team. 

Therefore, doing a sanity-check of the data is important because it generally makes the whole data management process much faster by avoiding reprocessing, and will increase the accuracy of the results by assuring that the data used as input complies with minimum standards of quality, adjusted to the importance and consequences of the decisions taken with this data.


b) Explain, in your own words, why real-world data is often messy.

- b)

Real-world data is messy, first, because some of it is hard to measure, either when using human judgement or instruments, since the precision and accuracy required may not be technically feasible. Second, because variables can be poorly defined and poor quality control methods are not implemented (for example, data validation, digital data collection tools that check the data input and prompt users to correct errors, etc.). Third, because of bad design of the structures that store the data, for example, by including data in the name of variables (such as age18-24, age25-32, ...). And, finally, because there may be a poor implementation of quality assurance systems that assess the data captured and make sure that it meets minimum standards of quality.

c) How do you determine which variables in your dataset you should check for issues prior to starting an analysis? 

- c)

To check which variables I should check for issues, I would use the following criteria:

- Likelihood of error: based on my knowledge of how this variable is captured, how hard it is to measure correctly?, how accurate are the instruments used to measure it? this will provide a sense of how much the variable can be trusted to be close to the actual value of the dimension that is being measured.
- Margin of error: based on my knowledge of how this variable is captured and how likely it is to be inaccurate, how large is the deviation from the actual value? is this deviation large enough to be directly dealt with?
- Weight in the analysis: what are the decisions that will be taken with the outputs generated using this variable? what would be the impact of making a wrong call on these decisions? The higher the weight in the analysis, the higher priority a variable will have in the data checking activities.

d) How do you know when you have adequately checked these variables?

- d)

The thoroughness with which these variables need to be checked is mainly a function of the third criterion mentioned above: the more important is the decision that needs to be taken and the higher the costs of making a wrong call, the higher the standard is for when it can be determined that these variables have been adequately checked. However, the two other criteria also weight in determining if there has been enough checking of this data: variables that are measured with a good track record of accuracy and precision don't need to be checked as thoroughly and often as the ones that have a poor track record.

e) Is it possible to fully vet your data for errors before you begin your analysis? If not, what should you be looking out for while you complete your analysis?

- e)

It is not, either because of resources contraints, because of the lack of tools and additional data required to do a more in-depth validation, or because some of the errors can only be detected (or are more easily detected) during the analysis fo the data. For this reason, as the analysis progresses, one should be looking for additional validation of the results against previous results or external sources. One should also try to do sense-checking between different results and contrast them with the available theory, if possible (for example, if I'm analyzing economic variables for a specific country and find that in the same time period both the inflation and unemployment rates are high while the country's growth has slowed, a situation known in economic theory as stagflation, this is a sign that I should check my data again since this is a quite rare event that is unlikely to happen).

## 3:  Elections (24 pts)

Consider the following data frame in Pandas.

In [11]:
import pandas

# creating a data frame from scratch - list of lists

data = [ ['marco', 165, 'blue', 'FL'], 
         ['jeb', 0, 'red', 'FL'], 
         ['chris', 0, 'white', 'NJ'], 
         ['donald', 1543, 'white', 'NY'],
         ['ted', 559, 'blue', 'TX'],
         ['john', 161, 'red', 'OH']
       ]

# create a data frame with column names - list of lists

col_names = ['name', 'delegates', 'color', 'state']
df = pandas.DataFrame(data, columns=col_names)
df

Unnamed: 0,name,delegates,color,state
0,marco,165,blue,FL
1,jeb,0,red,FL
2,chris,0,white,NJ
3,donald,1543,white,NY
4,ted,559,blue,TX
5,john,161,red,OH


a) Using bracket indexing in Pandas, show how many delegates `ted` got.

In [12]:
df[df.name == 'ted']['delegates']

4    559
Name: delegates, dtype: int64

b) Using bracket indexing in Pandas, show how many total delegates were obtained by candidates whose favorite color is blue.

In [13]:
df[df.color == 'blue']['delegates'].sum()

724

c) Using groupby and aggregate in Pandas, show how many total delegates were obtained by candidates grouped by favorite color.

In [17]:
df.groupby('color').aggregate(sum)['delegates']

color
blue      724
red       161
white    1543
Name: delegates, dtype: int64

## 4: Clinical disease data (30 pts)

Your boss comes to you Monday morning and says “I figured out our next step; we are going to pivot from an online craft store and become a data center for genetic disease information! I found **ClinVar** which is a repository that contains expert curated data, and it is free for the taking. This is a gold mine! Take a week and tell me what gene and mutation combinations are classified as dangerous.”

1)  Look at the sample data set (in the Sample ClinVar data below or in the .txt file) and develop a plan of action to use python to extract and summarize just what your boss wants. **Don’t code**. You can use pseudocode and/or and essay format to generate a plan in 500 words or less. 

2) Tell us the output that you expect from your planned code

**Hints:**  

* Look at the sample file carefully. What fields do you want to extract? Are they in the same place every time? What strategy will you use to robustly extract and filter your data of interest? How do you plan to handle missing data?

* Filter out junk. Just focus on what your boss asked for (1) gene name (2) mutation reference. (3) Filter your data to include only mutations that are dangerous as you define it. 

* Pandas and NumPy parsers correctly recognize the end of each line in in the ClinVar file.

* The unit of observation of this dataset is one row per mutation.

* While you shouldn't code your analysis, creating a few lines of code while you think through the problem may be helpful (so that you can sanity check that your plan works). So you can experiment, we have included the data file below as a Tab Separated Value file "Genomics_Questions.txt". Please do not submit any such code. For example, if I wanted to check that I accurately understand the "split" function in the context of this data, I could type:

```python
sample = "abc;def;asd"
test = sample.split(';')
```

**This is a planning question we want you to lay out a plan in text not code.** 

### VCF file description (Summarized from version 4.1)


* The VCF specification:

VCF is a text file format which contains meta-information lines, a header
line, and then data lines each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position.

* Fixed fields:

There are 8 fixed fields per record. All data lines are **tab-delimited**. In all cases, missing values are specified with a dot (‘.’). 

1. CHROM - chromosome number
2. POS - position DNA nuceleotide count (bases) along the chromosome
3. ID - The unique identifier for each mutation
4. REF - reference base(s) alleles
5. ALT - alternate base(s) alleles
6. QUAL - Phred scaled quality score
7. FILTER - filter status (if position has passed all filters)
8. INFO - a semicolon-separated series of  keys with values in the format: <key>=<data>, and specified as <key>=<data name>[data value definition].


### INFO field specifications

```
GENEINFO = <Gene symbol>
CLNSIG =  <Variant Clinical Significance (Severity)
  0 – unknown	(Uncertain significance)
  1 – untested	(not provided)
  2 - non-pathogenic	(Benign)
  3 - probable-non-pathogenic	(Likely benign)
  4 - probable-pathogenic	(Likely pathogenic)
  5 – pathogenic	(Pathogenic)
  6 - drug-response	(drug response)
  7 – histocompatibility	(histocompatibility)
  255 - other	(other)
```

### Representative/Sample ClinVar data (vcf file format)

```
##fileformat=VCFv4.0							
##fileDate=20160705							
##source=ClinVar and dbSNP							
##dbSNP_BUILD_ID=147							
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	949523	rs786201005	C	T	.	.	GENEINFO=ISG15;CLNSIG=5
1	949696	rs672601345	C	CG	.	.	GENEINFO=ISG15;CLNSIG=5;CLNDBN=Cancer
1	949739	rs672601312	G	T	.	.	GENEINFO=ISG15;CLNDBN=Cancer
1	955597	rs115173026	G	T	.	.	GENEINFO=AGRN;CLNSIG=2; CLNDBN=Cancer
1	955619	rs201073369	G	C	.	.	GENEINFO=AGG;CLNDBN=Heart_dis 
1	957640	rs6657048	C	T	.	.	GENEINFO=AGG;CLNSIG=3;CLNDBN=Heart_dis 
1	976059	rs544749044	C	T	.	.	GENEINFO=AGG;CLNSIG=0;CLNDBN=Heart_dis 
```

A second version of this file is provided as a .txt file in case you want to load it into your console to test it out. You can use either file for the data modeling.

#### Request interpretation
"Take a week and tell me what gene and mutation combinations are classified as dangerous." I understand this request as my boss asking me to find all the unique combinations between genes (stored as the GENEINFO=X substring in the INFO column) and mutations (stored in the ID column) that are classified as ('probable-pathogenic' or 'pathogenic', stored as 4 or 5 in the CLNSIG=Y substring of the INFO column) or (cancer or heart disease, stored as Cancer or Heart_dis in the CLNDBN=Z substring of the INFO column).

#### Preparation
1. Load data from the text file into a data frame as a csv, discarding the first four lines, using the fifth line as the column header and using tab as the separator between columns.
1. Remove the '#' character from the name of the first column.
1. Transform column names to lowercase.
1. Remove all unused columns: 'chrom', 'pos', 'alt', 'qual' and 'filter'.
1. Split column 'info' into three columns: 'geneinfo', 'clnsig' and 'clndbn'. The first will contain the substring between 'GENEINFO=' and ';', the second will contain the substring between 'CLNSIG=' and ';', and the last column will contain the substring after 'CLNDBN='. If either of these three substrings is not present, store value as NaN in the corresponding column.
1. Remove rows with missing values in both 'clnsig' and 'clndbn'.

#### Analysis
1. Keep only the columns where: (value in column 'clnsig' is equal to 4 or 5) or (value in column 'clndbn' is equal to Cancer or Heard_dis).
1. Create a column named 'gene-mutation' that concatenates the values of the 'geneinfo' and 'id' columns.
1. Generate a data frame called dangerous_gene_mutation that contains the unique values in the 'gene-mutation' column, include the geneinfo and id columns.

#### Presentation (expected output)
1. Create a table with the dangerous_gene_mutation data frame, grouped by gene and mutation, and include the 'danger' classification level columns, i.e. clnsig and clndbn.
1. Export this table and share it with my boss. The table should look like this:

|Gene  |Mutation|CLNSIG|CLNDBN   |
|------|--------|------|---------|
|gene_a|mut_1   |4     |NaN      |
|      |mut_2   |4     |Heart_dis|
|      |mut_3   |1     |Cancer   | 
|gene_b|mut_4   |5     |Cancer   |
|      |mut_5   |4     |NaN      |
|gene_c|mut_6   |3     |Cancer   |
|      |mut_7   |5     |Hear_dis |
|      |mut_8   |4     |NaN      |