# Demand Media Data Science Intern test.
To begin please clone this repository and run the notebook.
Checkout [this resource](https://try.github.io/levels/1/challenges/1) for help with git. And similarly for a rundown on [Jupyter.](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/)

## Data Tools

#### Question 1: Convert CSV to JSON
List some of the tools you would use to convert a csv file to json. Please list all the different approaches you can think of, for solving a data transformation problem like this.

##### Answer:
The only way I know how to do such conversion is to do so in Python by following a few steps. First, get the path for both input CSV and output JSON. Then read the CSV with DictReader and "dump" it into the JSON file line by line.

## SQL
Using the database schema below, answer the following questions.


|movie   |actor |casting|
|--------|------|-------|
|id      |id    |movieid|
|title   |name  |actorid|
|yr      |      |ord    |
|director|      |       | 
|budget  |      |       |
|gross   |      |       |






#### Question 2:
Find the actors in the movie "Gone with the Wind"

##### Answer:
(see movie.sql for table creation)
```SQL
SELECT A.aname
FROM movie M, actor A, casting C
WHERE M.mid = C.mid AND A.aid = C.aid
AND M.mid IN (
	SELECT M.mid
	FROM movie M
	where M.title = 'Gone with the Wind'
);
```

#### Question 3:
List the films together with the leading star for all 1962 films.

##### Answer:
```SQL
SELECT M.title, A.aname
FROM movie M, actor A, casting C
WHERE M.mid = C.mid AND A.aid = C.aid AND C.ord = 1 AND M.yr = 1962;
```

## Data Science Questions

#### Question 4:
Explain the curse of Dimensionality to a child?

##### Answer:
In everyday lives, we often only need to deal with things that are equal to or under 3 dimensions. A point is 0-dimensional; a line is 1-dimensional or 1D; a shape is 2-dimensional or 2D; and a physical object is 3-dimensional or 3D. But in data analysis, sometimes the data live in a higher dimensional space and it leads to a lot of problems.  
  
The first problem is that they are very hard to visualize. Since we can only see things up to 3 dimensions, when we illustrate higher dimensional data, we will have to do a projection and lose some information. Here is an analogy: if I tell you that the cross-section of an object is a circle, you will not be able to tell whether the actual object is a sphere, a cylinder, or of other shape. Similarly, if we try to plot and look at our data, we can only see a part of it.  
  
Another problem of high dimensionality is for each additional dimension, it takes a much longer, in fact exponential, time to compute anything about the entire dataset. For instance, let's say we start with one cookie so it is easy for us to count. If now, instead, we have a line of 10 cookies, it is still relatively easy for us to count. However, imagine if we have a tray, which is 2D, of 10<sup>2</sup> = 100 cookies or even a box, which is 3D, of 10<sup>3</sup> = 1000 cookies, the number of cookies are growing exponentially. If we were to count the cookies one by one, for each dimension we go up, it will take us 10 times as long! 

#### Question 5:
Why is mean square error a bad measure of model performance? What would you suggest instead?

##### Answer:
There are two problems with mean square error, the "mean" part and the "square" part. By taking an average, the error might be hugely affected by a single outlier. Also, the distribution of the errors is lost by taking the mean. The well-known Anscombe's quartet, for instance, demonstrate that how the same model can fit on 4 different patterns of data but give us the same mean square error. By taking the square, it treats positive and negative errors the same, which is not necessarily what we want for some models that we always want to overpredict rather than underpredict (or vice versa), like risk assessment.

#### Question 6:
Do you think 50 small decision trees are better than a large one? Why?

##### Answer:
It depends. If we have enough samples to make 50 small decision trees into an ensemble, that will be better than a large one because it could avoid the problem of overfitting. 

#### Question 7:
How would you explain an A/B test to an engineer with no statistics background?

##### Answer:
A/B test is a popular method to compare two versions of a product (such as a webpage or an application). I will use webpage as my example and I will call the two versions Version A and Version B. For each user who visit the website, one of Version A or B will be shown to him or her at random. Afterwards, statistical analysis will be used to determine which version is better for the specific goal we are trying to achieved. For instance, let's say want to test whether the color combination of the text "Buy now!" and the background affects sales on an online shopping site. Two possible variations will be black text and white background (A) or white text and black background (B). After we have enough sample customers visiting each version, we can check the click-through rate of the "Buy now!" button or whether the sales for final to see whether one version is performing statistically significantly better than the other.  
  
A more common application of the A/B test is comparing a new variation against the current version and decide whether the new version is "better", depending on your specific goals. A/B test allows us to 1) improve our website/application to a better version if the variation is better than the current version, or 2) learn that a specific variation and the associated direction of changes will not work on our product if the variation is worse than the current version.

## Python

#### Question 8:
Find the mean, median and mode of the elements in the list.
```python
a = [10,15,3,4,67,43,12]
```

##### Answer (stat.py):
```python
def mean(a):
    n = len(a)
    return sum(a)/n
print "mean = %d" % mean(a)

def median(a):
    a = sorted(a)
    n = len(a)
    if n%2 == 1:
        return a[n//2]
    else:
        n = n//2
        return '%g'%(float(a[n-1]+a[n])/2)
print "median = %d" % median(a)

def mode(a):
    count = []
    for i in range(len(a)):
        count.append(a[i])
    freq = max(count)
    if freq == 1:
        #if no element is repeated, there is no mode
        return []
    else:
        return sorted([i for i in set(a) if a.count(i)==freq])
print "mode = %s" % mode(a)
```
**Output:**
```
mean = 22
median = 12
mode = []
```

#### Question 9:
What will be the output of the code below? Please explain your answer.
```python
def multipliers():
  return [lambda x : i * x for i in range(4)]
    
print [m(2) for m in multipliers()]
```
How would you modify the definition of `multipliers` to produce the presumably desired behavior?

##### Answer:
**The output will be:**
```
[6, 6, 6, 6]
```
**Explanation:**  
When we first called multipliers() in the for statement, the function was called and the "for i in range(4)" loop was run. When m(2) is called inside the list, the variable i for looked up AGAIN. Since the for loop has run already, i is at the last value, i.e. i = 3. Now, if we substitute x = 2 into the lambda function, we get that each element equals i\*x = 3\*2 = 6.
  
**Modification 1:**
```python
def multipliers():
  return [lambda x, i=i : i * x for i in range(4)]
```
**Modification 2:**
```python
multipliers = lambda x: [i*x for i in range(4)]
```
In this case, we will simply call
```python
print multipliers(2)
```
instead.  
  
**Desired output:**
```
[0, 2, 4, 6]
```
which is the first 4 multiples of 2 starting from 0.