![Hooper](https://raw.githubusercontent.com/interviewquery/takehomes/hopper_1/hopper_1/logo.png)
# Candidate Assignment

![](https://github.com/interviewquery/takehomes/blob/hopper_1/hopper_1/image1.jpg?raw=1)

Thanks for your interest in a data science position at Hopper! As the
next step in your interview process we would like you to complete the
4 exercises below. Please calibrate the depth of your answers such
that you spend about 1 hour total on this work.

We use this homework to gauge how you solve a range of technical
problems that require both lite coding and quantitative thinking. You
can use the language of your choice to complete these questions
(though Hopper is most familiar with Python and R). Feel free to use
any resources you need to solve these questions, so long as you
complete and present your own work.

Submit your answers in a separate document (Jupyter notebooks or
RMarkdown are both great!) and make sure to give us any instructions
needed for running the code sections. We look forward to seeing your
work!

# Exercise 1 - Programming

![](https://github.com/interviewquery/takehomes/blob/hopper_1/hopper_1/image2.jpg?raw=1)

Given the table of airports and
locations (in latitude and
longitude) below, write a function that takes an airport code as input
and returns the airports listed from nearest to furthest from the
input airport. Use only the basic libraries for the language of your
choice (using sorting functions/methods provided by the standard
library is definitely fine).

Airport data is `ex1_table.csv`

# Exercise 2 - Testing

![](https://github.com/interviewquery/takehomes/blob/hopper_1/hopper_1/image3.png?raw=1)

*Note:* please use any of your favorite packages/libraries for this
section of the homework

One of Hopper's innovative products is "Price Freeze" which allows
users to freeze a price for a period of time before purchasing the
ticket.

Suppose we are running a test comparing the current best model for
pricing a Price Freeze (the Champion) and a new model we think might
be better (the Challenger). We run the test showing these two
different variants to our users but we realize there is an issue! The
model is being shown to different numbers of people on different
mobile devices (iOS and Android) and some of the users are also seeing
a discount being offered. This makes the results of the test a bit
hard to interpret. Given the data `ex2_table.csv`, where

-   `variant` describes which model was used
-   `device_type` tells us which mobile OS that app is running on
-   `discount` is whether or not the users in this group received a
    discount
-   `total_views` is how many users saw the option to freeze
-   `price_freezes` is how many users chose to price freeze

Answer the following questions about the experiment, make sure to show
any code, math, or reasoning you have for choosing your answer.

1.  What is the probability that the Challenger is the superior model?
2.  Based on your answer to number 1, would you be comfortable deciding
    yes/no on whether or not to change models?
3.  If we decide to switch exclusively to the Challenger model for our
    iOS users, do we have a reasonable chance at getting 500 prices
    freezes in the first 10,000 views? what about 600?

# Exercise 3 - Representation

![](https://github.com/interviewquery/takehomes/blob/hopper_1/hopper_1/image4.jpg?raw=1)

*Note:* Don't worry about writing code in this section, you can just
describe any transformations of the data you would perform. Your
description should be clear enough that a data scientist reading this
would know how to implement your solution if necessary.

Data is `ex3_table.csv`

We want to create a mathematical model of a user's "trip" which can be
described as a collection of searches. This requires us to represent
this non-numeric data such that we can draw quantitative conclusions.

1.  How would you transform this collection of searches into a numeric
    vector representing a trip?

    1.   Assume that we have hundreds of thousands of users and we want to
        represent all of their trips this way.
    
    2.   We ideally want this to be a general representation we could use in
        multiple different modeling projects, but we definitely care about
        finding similar trips.

2.  How, precisely, would you compare two trips to see how similar they
    are?

3.  What information do you feel might be missing from data above that
    would be helpful in improving your representation?

# Exercise 4 - Experiments and Data Collection

![](https://github.com/interviewquery/takehomes/blob/hopper_1/hopper_1/image6.jpg?raw=1)

An essential job of Hopper data science is coming up with new models
for our products and testing to both see which models are better and
to learn more about our products to help us better understand how to
improve our models.

One of the core features of the Hopper app is that it advises users
whether to buy a ticket now or wait for the price to go down and book
later. But what if our buy recommendation is wrong and the price in
fact drops after the user books on Hopper? To lower the pain when this
happens Hopper has introduced "Price Drop" which refunds users a
certain % of the fare difference if the price drops *after* they book.

If you were the data scientist in charge of this project what
information would you want to track to decide whether this feature is
successful? What would you track to determine how to improve this
feature?


In [1]:
!git clone --branch hopper_1 https://github.com/interviewquery/takehomes.git
%cd takehomes/hopper_1
!ls

Cloning into 'takehomes'...
remote: Enumerating objects: 1963, done.[K
remote: Counting objects: 100% (1963/1963), done.[K
remote: Compressing objects: 100% (1220/1220), done.[K
remote: Total 1963 (delta 752), reused 1928 (delta 726), pack-reused 0 (from 0)[K
Receiving objects: 100% (1963/1963), 297.43 MiB | 12.54 MiB/s, done.
Resolving deltas: 100% (752/752), done.
/content/takehomes/hopper_1
ex1_table.csv  ex3_table.csv  image2.jpg  image4.jpg  logo.png	     takehomefile.ipynb
ex2_table.csv  image1.jpg     image3.png  image6.jpg  metadata.json


In [None]:
# Write your code here

In [2]:
MAIN_PATH = '/content/takehomes/hopper_1'

# Imports

In [4]:
import pandas as pd

In [6]:
table_1 = pd.read_csv(MAIN_PATH + '/ex1_table.csv')

In [7]:
table_1

Unnamed: 0,Airport Code,Lat,Long
0,CDG,49.012798,2.55
1,CHC,-43.489399,172.531998
2,DYR,64.734901,177.740997
3,EWR,40.692501,-74.168701
4,HNL,21.318701,-157.921997
5,OME,64.512199,-165.445007
6,ONU,-20.65,-178.699997
7,PEK,40.080101,116.584999


# Exercise 1: Rank Airports from Nearest to Furthest

In [9]:
table_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Airport Code  8 non-null      object 
 1   Lat           8 non-null      float64
 2   Long          8 non-null      float64
dtypes: float64(2), object(1)
memory usage: 320.0+ bytes


In [11]:
table_1 = table_1.astype({'Airport Code': 'str'})

In [34]:
def sort_airports_by_distance(table: pd.DataFrame, centre_airport_code: str) -> pd.DataFrame:

    centre_rows = table[table['Airport Code'] == centre_airport_code]
    centre_lat = centre_rows['Lat'].iloc[0]
    centre_long = centre_rows['Long'].iloc[0]
    table['Distance'] = ((table['Lat'] - centre_lat) ** 2) + ((table['Long'] - centre_long) ** 2)
    table['Rank'] = table['Distance'].rank()
    table = table.sort_values('Rank', ascending = True)
    return table

In [35]:
sort_airports_by_distance(table = table_1, centre_airport_code = 'CDG')

Unnamed: 0,Airport Code,Lat,Long,Distance,Rank
0,CDG,49.012798,2.55,0.0,1.0
3,EWR,40.692501,-74.168701,5954.986448,2.0
7,PEK,40.080101,116.584999,13083.774108,3.0
4,HNL,21.318701,-157.921997,26518.224866,4.0
5,OME,64.512199,-165.445007,28462.553904,5.0
2,DYR,64.734901,177.740997,30939.070083,6.0
1,CHC,-43.489399,172.531998,37450.536051,7.0
6,ONU,-20.65,-178.699997,37704.466792,8.0


In [36]:
sort_airports_by_distance(table = table_1, centre_airport_code = 'EWR')

Unnamed: 0,Airport Code,Lat,Long,Distance,Rank
3,EWR,40.692501,-74.168701,0.0,1.0
0,CDG,49.012798,2.55,5954.986448,2.0
4,HNL,21.318701,-157.921997,7389.958711,3.0
5,OME,64.512199,-165.445007,8898.742094,4.0
6,ONU,-20.65,-178.699997,14689.694187,5.0
7,PEK,40.080101,116.584999,36387.349195,6.0
2,DYR,64.734901,177.740997,64036.533207,7.0
1,CHC,-43.489399,172.531998,67947.827106,8.0


# Exercise 2

In [37]:
table_2 = pd.read_csv(MAIN_PATH + '/ex2_table.csv')

In [39]:
table_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   device_type    8 non-null      object
 1   variant        8 non-null      object
 2   discount       8 non-null      bool  
 3   total_views    8 non-null      int64 
 4   price_freezes  8 non-null      int64 
dtypes: bool(1), int64(2), object(2)
memory usage: 392.0+ bytes


In [41]:
table_2

Unnamed: 0,device_type,variant,discount,total_views,price_freezes
0,android,Challenger,False,6010,189
1,android,Challenger,True,331,16
2,android,Champion,False,1084,23
3,android,Champion,True,54,3
4,iOS,Challenger,False,6905,336
5,iOS,Challenger,True,1986,196
6,iOS,Champion,False,6576,266
7,iOS,Champion,True,2054,161


1. What is the probability that the Challenger is the superior model?


A: The performance of a model can be measured in terms of
- How many times the prize froze the price when there is a discount
- How many times did the model not freeze the price when there is no discount

2. Based on your answer to number 1, would you be comfortable deciding yes/no on whether or not to change models?
3. If we decide to switch exclusively to the Challenger model for our iOS users, do we have a reasonable chance at getting 500 prices freezes in the first 10,000 views? what about 600?