# Python Fundamentals for Data Science: Final Exam


## Instructions
The final exam is designed to evaluate your grasp of Python theory as well as Python coding.

- This is an individual exam.
- You have 24 hours to complete the exam, starting from the point at which you first access it.
- You will be graded on the quality of your answers.  Use clear, persuasive arguments based on concepts we covered in class.
- Please double-click the markdown cells where it says "Your answer here" to input answers (if you need more cells please make them markdown cells)

## YOUR NAME HERE: Youzhi Chloe Wu

## 1: General Questions (21 pts )

a) The following method is part of a larger program used by a mobile phone company.  It will work when an object of type MobileDevice or of type ServiceContract is passed in.  This is a demonstration of (select all that apply and state a reason why it applies):

    1. Inheritance
    2. Polymorphism
    3. Duck typing
    4. Top-down design
    5. Functional programming

In [5]:
# Method:

def add_to_cart(item):
    cart.append(item)
    total += item.price

- a) Your answer here

> This is a demonstration of 3. Duck typing. 
   - Duck typing: Python does not check variable types before running a function. The add_to_card() method works for both MobileDevice and ServiceContract objects. 

b) Suppose you have a long list of digits (0-9) that you want to write to a file.  From a storage standpoint, would it be more efficient to use ASCII or UTF-8 as an encoding? How could you create an even smaller file to store the information?

- b) Your answer here

> It would be more efficient to use ASCII from storage standpoint, because ASCII uses 7 bits to represent a character, whereas a character may occupy a minimum of 8 bits in UTF-8. We could apply Hoffman coding algorithm to compress the file to an even smaller size. 

c) You are part of a team working on a spreadsheet program that is written in Python 3.  The program includes several classes to represent different types of objects that fit into a cell of a spreadsheet.  Give a strong argument for why your team should write an abstract base class to represent such objects and give examples of what should go into such an abstract base class.

- c) Your answer here

> One of the reasons is that the abstract base class which will contain all of the common shared properties, methods, etc. All sub classes can all automatically inherit the properties/methods of the base class. Another benefit would be if you need to make a change you only need to make it in one place, instead of in all sub-classes. These different types of objects all share same properties to fit into a cell of a spreadsheet. Constructing a base class to allow these different objects to inherit commonly shared properties/methods is necessary. 

> The base class should include: an initializer of the base class, attributes / properties / methods that will be shared among sub-classes, such as fitting into a cell of a spreadsheet.  

d) Explain why NumPy is better than lists for "vectorized" math operations. Give an example of an operation that is either impossible or painful to implement using traditional Python lists compared to NumPy arrays.

- d) Your answer here

    -  NumPy has n-dimensional array object which is a container for large data sets allowing for fast reading/writing. Because NumPy arrays are homogeneous, the operations taken on the array could be tremendously speed up. Whereas in the case of list, the data type of every one of the items needs to be checked first. 
    -  Also, NumPy arrays enable you to perform mathematical operations on whole blocks of data instead of looping through each element of a list. 
    -  Many mathematical operations taken on NumPy arrays match exactly the behavior seen in linear algebra. Whereas if using list, special functions will need to be defined to loop through each element to achieve the same results. 
    -  Example: To change the shape of a n-by-n matrix, you can just use NumPy reshape() method; whereas, it would be quite painful to achieve this by using Python list. 

e) We want a list of the numbers that are the square of nonnegative integer less than 10, but whose squares are greater than 10.  Fill in a list comprehension below so that we get the desired output: [16, 25, 36, 49, 64, 81].

In [5]:
# 1e Your code here
print([x**2 for x in range(10) if x**2 > 10])

[16, 25, 36, 49, 64, 81]


f) Explain why the following code prints what it does.

In [7]:
def f(): pass
print(type(f))

<class 'function'>


- f) Your answer here

> f is defined as a function, although it does not do anything. That's why when printing the type of f (the variable name for this defined function), it is printed as a 'function'.

g) Explain why the following code prints something different.

In [8]:
def f(): pass
print(type(f()))

<class 'NoneType'>


- g) Your answer here

> The difference between g) and f) is that in g), it is printing the type of the output of f function. As defined in f, the function does not return anything. That's why the type of its output is NoneType. 

## 2: Data Integrity (25 pts)

a) Why is it important to sanity-check your data before you begin your analysis? What could happen if you don't?

- a) Your answer here

> There are many reasons that sanity check before analysis is very important:
- There may be mismatches in data type, or variations in how values are entered, or missing values in the variables that you would like to analyze. These scenarios could easily mess up your analysis. 
- There could be duplicate or outliers that may distort your analysis. Although one does need to take caution in evaluating and manipulating duplicates/outliers. 

> If you do not conduct sanity check, you may run into error when conducting analysis, or your analysis may be distorted by ourliers/duplicates/null values. 

b) Explain, in your own words, why real-world data is often messy.

- b) Your answer here

> Real-world data is most likely entered by human or collected based off human's activities. Human's interpretations of the same data field could be different. Even if it happens that human's understanding is consistent, their data entry behavior would still be very different unless a rigid data validation control is in place at data collection. Besides, there are many scenarios where data could not be collected or collected in the wrong format due to unexpected situations. 

c) How do you determine which variables in your dataset you should check for issues prior to starting an analysis? 

- c) Your answer here

> First decide on which variables you would use in your analysis. These are variables not only directly presented in your analysis report, but also those related to the presented variables. Once you have that, you should check for issues for all the variables that will be involved in your analysis. 

d) How do you know when you have adequately checked these variables?

- d) Your answer here

> I think theoretically it is never adequate. But after initial data cleansing, such as removing missing values / mismatched / duplicates / outliers that do not make common sense, you can begin data analysis. The data checking process is iterative and progressive. After some data analysis, such as distribution plotting, you need to ask if the results are following common sense, if not, why not. By asking these questions throughout your analysis, it could further trigger you to conduct some more in-depth checking. Or, you would need to make reasonable assumptions. 

e) Is it possible to fully vet your data for errors before you begin your analysis? If not, what should you be looking out for while you complete your analysis?

- e) Your answer here

> I think it most likely is not possible to fully vet data for errors before analysis. Therefore, vetting data for errors should always be an integral part of analysis. You should look for odd data distribution, peculiar outliers, graphs/results that do not make common sense throughout your analysis.

## 3:  Elections (24 pts)

Consider the following data frame in Pandas.

In [6]:
import pandas

# creating a data frame from scratch - list of lists

data = [ ['marco', 165, 'blue', 'FL'], 
         ['jeb', 0, 'red', 'FL'], 
         ['chris', 0, 'white', 'NJ'], 
         ['donald', 1543, 'white', 'NY'],
         ['ted', 559, 'blue', 'TX'],
         ['john', 161, 'red', 'OH']
       ]

# create a data frame with column names - list of lists

col_names = ['name', 'delegates', 'color', 'state']
df = pandas.DataFrame(data, columns=col_names)
df

Unnamed: 0,name,delegates,color,state
0,marco,165,blue,FL
1,jeb,0,red,FL
2,chris,0,white,NJ
3,donald,1543,white,NY
4,ted,559,blue,TX
5,john,161,red,OH


a) Using bracket indexing in Pandas, show how many delegates `ted` got.

In [12]:
# 3a) Your code here
df[df.name == 'ted']['delegates']

4    559
Name: delegates, dtype: int64

b) Using filtering in Pandas, show how many total delegates were obtained by candidates whose favorite color is blue.

In [13]:
# 3b) Your code here
df[df.color == 'blue']['delegates'].sum()

724

c) Using groupby and aggregate in Pandas, show how many total delegates were obtained by candidates grouped by favorite color.

In [15]:
# 3c) Your code here
df.groupby('color').agg({'delegates':'sum'})

Unnamed: 0_level_0,delegates
color,Unnamed: 1_level_1
blue,724
red,161
white,1543


## 4: Clinical disease data (30 pts)

Your boss comes to you Monday morning and says “I figured out our next step; we are going to pivot from an online craft store and become a data center for genetic disease information! I found **ClinVar** which is a repository that contains expert curated data, and it is free for the taking. This is a gold mine! Take a week and tell me what gene and mutation combinations are classified as dangerous.”

1)  Look at the data set and develop a plan of action to use python to extract and summarize just what your boss wants. **Don’t code**. You can use pseudocode and/or and essay format to generate a plan in 500 words or fewer. 

2) Tell us the output that you expect from your planned code

**Hints:**  

* Look at the below sample file carefully. What fields do you want to extract? Are they in the same place every time? What strategy will you use to robustly extract and filter your data of interest? How do you plan to handle missing data?

* Filter out junk. Just focus on what your boss asked for (1) gene name (2) mutation reference. (3) Filter your data to include only mutations that are dangerous as you define it. 

* Pandas and NumPy parsers correctly recognize the end of each line in in the ClinVar file.

* The unit of observation of this dataset is one row per mutation.

* While you shouldn't code your analysis, creating a few lines of code while you think through the problem may be helpful (so that you can sanity check that your plan works). So you can experiment, we have included the data file below as a Tab Separated Value file "Genomics_Questions.txt". Please do not submit any such code. For example, if I wanted to check that I accurately understand the "split" function in the context of this data, I could type:

```python
sample = "abc;def;asd"
test = sample.split(';')
```
* This is similar to a task that one of us tackled at work. You can answer the question with the information provided below or you can look up things like `VCF file specification`. Our goal is to see that you can put together a sensible plan, describe the parsing strategy, and document and justify the decisions that you made.

**This is a planning question we want you to lay out a plan in text not code.** 

### VCF file description (Summarized from version 4.1)

```
* The VCF specification:

VCF is a text file format which contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The format also can contain genotype information on samples for each position.

* Fixed fields:

There are 4 fixed fields per record. All data lines are **tab-delimited**. In all cases, missing values are specified with a dot (‘.’). 

1. CHROM - chromosome number
2. POS - position DNA nuceleotide count (bases) along the chromosome
3. ID - The unique identifier for each mutation
4. INFO - a semicolon-separated series of  keys with values in the format: <key>=<data>, and specified as <key>=<data name>[data value definition].

```
### INFO field specifications

```
GENEINFO = <Gene symbol>
CLNSIG =  <Variant Clinical Significance (Severity)>
  0 – unknown	(Uncertain significance)
  1 – untested	(not provided)
  2 - non-pathogenic	(Benign)
  3 - probable-non-pathogenic	(Likely benign)
  4 - probable-pathogenic	(Likely pathogenic)
  5 – pathogenic	(Pathogenic)
  6 - drug-response	(drug response)
  7 – histocompatibility	(histocompatibility)
  255 - other	(other)
CLNDBN = <Disease name>

```

### Representative/Sample ClinVar data (vcf file format)

```
##fileformat=VCFv4.0							
##fileDate=20160705							
##source=ClinVar and dbSNP							
##dbSNP_BUILD_ID=147							
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	949523	rs786201005	C	T	.	.	GENEINFO=ISG15;CLNSIG=5
1	949696	rs672601345	C	CG	.	.	GENEINFO=ISG15;CLNSIG=5;CLNDBN=Cancer
1	949739	rs672601312	G	T	.	.	GENEINFO=ISG15;CLNDBN=Cancer
1	955597	rs115173026	G	T	.	.	GENEINFO=AGRN;CLNSIG=2; CLNDBN=Cancer
1	955619	rs201073369	G	C	.	.	GENEINFO=AGG;CLNDBN=Heart_dis 
1	957640	rs6657048	C	T	.	.	GENEINFO=AGG;CLNSIG=3;CLNDBN=Heart_dis 
1	976059	rs544749044	C	T	.	.	GENEINFO=AGG;CLNSIG=0;CLNDBN=Heart_dis 
```

# Plan of Action for ClinVar 

## 1. Data Exploration
> ClinVar data is presented in VCF file format. The question we would like to answer by this plan is "what gene and mutation combinations are classified as dangerous". Therefore, our data extraction and analysis will mainly focus on data field that could answer this question. By examing the sample file, we would like to extract the following fields:
- __CHROM__: The chromosome number
- __ID__: The unique identifier for each mutation
- __POS__: position DNA nuceleotide count (bases) along the chromosome. POS data field is needed in that for certain mutations, the gene has the same symbol (or name), but their mutations are different and the gene's position is different. To accurately identify the genetic information of a mutation, POS should be included. With CHROM and POS, we could accurately identify the mutation position. 
- __INFO__: which includes (1) _GENEINFO_: Gene symbol; (2) _CLNSIG_: Clinical Significance (Severity); (3) _CLNDBN_: Disease name

> Next we would like to define "dangerous mutation" and filter to present only our data of interest. The VCF file specifications provides 0 to 7 as well as 255 (Other) as the identifications of clinical severity. However, we notice that some mutations are missing CLNSIG values; whereas CLNDBN information is provided. Therefore, we would combine two variables CLNSIG and CLNDBN to evaluate the danger level of a mutation. In our definition, a mutation is considered as "dangerous" if it meets any of the following criteria:
1. CLNSIG value is 4 or 5;
2. It has CLNDBN value; and its CLNSIG value is either missing, or not 2, 3, 6, 7. 

> Based off this direction, in our next section, we will provide details on how we will prepare, filter and format our data in a way for our data analysis. 


## 2. Data Preparation
> We will take the following steps to prepare our data for analysis: 
> 1. Install PyVCF and import vcf.  
> 2. Use vcf.Reader to read in VCF file and get an iterable of Record instances. The module allows us to access the attributes of Record object easily. 
> 3. Semicolon-delimited lists of key=value pairs are converted to Python dictionaries. Therefore, we are able to access values within INFO data field. 
> 4. Import vcf.filters, use existing filters or create customized filters to only return records that are considered as "dangerous" as defined in Section 2. 
> 5. Convert the filtered records into pandas dataframe for our data analysis. 
> 6. Examine our final dataframe. It is possible that the key data fields that we would like to present (as listed in Section 2) are missing. Below we summarize modifications of the data in case of missing values: 
    - Missing ID: in this scenario, we may consider create a separate identification numbering to differentiate them from those with ID values, i.e. their ID may follow "Unknown-001" convention. 
    - Missing CLNSIG: we will assign "Unknown" value to these records.
    - Missing CLNDBN: we will assign "Unknown" value to these records.


## 3. Data Visualization
> In our data visualization, we would list our dangerous mutations in a tabular format with the key data fields presented. Along with the table, we will plot two distribution graphs to show dangerous mutations by their CLNSIG, and by CLNDBN.
