# Demand Media Data Science Intern test.
To begin please clone this repository and run the notebook.
Checkout [this resource](https://try.github.io/levels/1/challenges/1) for help with git. And similarly for a rundown on [Jupyter.](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/)

## Data Tools

#### Question 1: Convert CSV to JSON
List some of the tools you would use to convert a csv file to json. Please list all the different approaches you can think of, for solving a data transformation problem like this.

<b>Answer:</b> I am most familiar with Python, so I would probably start with that. There are multiple ways to do it within Python. You can load the rows into python as lists (row_list = line.split(',') or equivalent) and then arrange as necessary and use the json package to write the output file. You can also write the json "manually" in Python by placing the brackets, braces and commas as necessary.

Other programming languages likely have similar and equivalent methods.

In some cases, depending on the size and the amount of editing that needs to be done, the file can be edited in Sublime. Select multiple rows at a time and use Ctrl+Shift+L to get many cursors.

A quick google search yields some web-based converters as well.

## SQL
Using the database schema below, answer the following questions.


|movie   |actor |casting|
|--------|------|-------|
|id      |id    |movieid|
|title   |name  |actorid|
|yr      |      |ord    |
|director|      |       | 
|budget  |      |       |
|gross   |      |       |






#### Question 2:
Find the actors in the movie "Gone with the Wind"

```
SELECT a.name FROM actor a
JOIN casting c
ON c.actorid=a.id
JOIN movie m
ON m.id = c.movieid
WHERE m.title = "Gone with the Wind";
```

#### Question 3:
List the films together with the leading star for all 1962 films.

```
SELECT m.title,a.name FROM movie m
JOIN casting c
ON c.movieid=m.id
JOIN actor a
ON a.id = c.actorid
WHERE c.ord=1
AND m.yr=1962;
```

## Data Science Questions

#### Question 4:
Explain the curse of Dimensionality to a child?

<b>Answer:</b> There are very many data points (features) and it is impossible to know ahead of time which ones are important. 

#### Question 5:
Why is mean square error a bad measure of model performance? What would you suggest instead?

<b>Answer:</b> Depends on context. Your business goals may be punished more highly for a high estimate than a low estimate, which squared error would treat symmetrically. In this case, you will need to use a more customized loss function.

If you want to classify data into discrete groups rather than estimate a continuous quantity, squared error won't work well. Instead use something like precision, recall, F-score or AUC.

#### Question 6:
Do you think 50 small decision trees are better than a large one? Why?

<b>Answer:</b> One large decision tree may be more intelligible (easier to understand and explain model) and easier to troubleshoot.

Many small decision trees, each trained on random subsets of data, are more resistant to overfitting (random forest).

#### Question 7:
How would you explain an A/B test to an engineer with no statistics background?

<b>Answer:</b> We need to test to see if this new interface causes our customers to buy more or not. It is normal to have some level of random variation over time in purchase rates so we need to be able to know if purchases rates in this test are due to random chance or really because of our new interface. We can use statistics to give us some confidence in our result.

## Python

#### Question 8:
Find the mean, median and mode of the elements in the list.
```python
a = [10,15,3,4,67,43,12]

```

In [44]:
a = [10,15,3,4,67,43,12]

def mean(arr):
    return float(sum(a))/len(a)

def median(arr):
    n = len(arr)
    if n%2:
        return sorted(arr)[n//2]
    else:
        return mean(sorted(arr)[(n//2)-1:(n//2)+1])

#on tie, return all with the maximum number of counts -- other tiebreakers are possible
def mode(arr):
    counts = [(n,arr.count(n)) for n in set(arr)]
    maxcount = max(zip(*counts)[1])
    return [num for num,count in counts if count==maxcount]
    
print 'mean=',mean(a)
print 'median=',median(a)
print 'mode=',mode(a)

mean= 22.0
median= 12
mode= [67, 4, 10, 43, 12, 15, 3]


#### Question 9:
What will be the output of the code below? Please explain your answer.
```python
def multipliers():
  return [lambda x : i * x for i in range(4)]
    
print [m(2) for m in multipliers()]
```
How would you modify the definition of `multipliers` to produce the presumably desired behavior?

<b>Answer:</b> The result is [6,6,6,6]. In the list comprehension, the variable i is placed into the lambda function but is not evaluated before the value that i refers to is changed. Therefore the function is evaluated at the last value that i took within the list comprehension inside the function (i=3). One way to fix it is to define a helper function. In this case the lambda function is evaluated and stored at each value of i.

You can also solve this other ways, for example, by doing the integer multiplication inside the list comprehension, rather than just storing a lambda function.

In [45]:
def fun(i):
    return lambda x : x * i

def multipliers():
    return [lambda x : i * x for i in range(4)]
    
def multipliersfixed():
    return [fun(i) for i in range(4)]
    
print [m(2) for m in multipliers()]
print [m(2) for m in multipliersfixed()]

def alternate(a):
    return [a*i for i in range(4)]

print alternate(2)

[6, 6, 6, 6]
[0, 2, 4, 6]
[0, 2, 4, 6]
