# Statististics
## From Data to Decisions

## All Disciplines
- Social Sciences
- Medicine
- Engineering
- Public Policy
- Psychology
- Climatology
- Robotics
- Archaeology
- Health Sciences
- Finance
- Business & Marketing

even...
- Biology
- Physics

In [6]:
import pandas as pd, numpy as np

In [23]:
size_v_cost = pd.DataFrame(dict(house_size=(1400,
                                      2400,
                                      1800,
                                      1900,
                                      1300,
                                      1100), 
                                cost=(112000,
                                      192000,
                                      144000,
                                      152000,
                                      104000,
                                      88000)))


In [34]:

def valuing_houses_1(house_size=1300, svc=None):
    """
    :param size: Size (in sqft) of house
    :returns: How much money should you pay for that house
    """
    svc =  size_v_cost if svc is None else svc
    return svc.cost[svc.house_size==house_size].iat[0]

In [35]:
size_v_cost['house_size']

0    1400
1    2400
2    1800
3    1900
4    1300
5    1100
Name: house_size, dtype: int64

In [36]:
size_v_cost['house_size']==1300

0    False
1    False
2    False
3    False
4     True
5    False
Name: house_size, dtype: bool

In [37]:
valuing_houses_1()

104000

In [38]:
valuing_houses_1_answer = 'https://www.udacity.com/course/viewer#!/c-st101/l-48696651/e-48727691/m-48734093'
assert valuing_houses_1()==104000, 'A house of 1300 sq ft sold for 104,000 you should pay 104,000 {}'.format(valuing_houses_1_answer)

In [39]:
def valuing_houses_2():
    return valuing_houses_1(1800)

In [40]:
valuing_houses_2_answer = 'https://www.udacity.com/course/viewer#!/c-st101/l-48696651/e-48532777/m-48719197'
assert valuing_houses_2()==144000, 'A house of 1300 sq ft sold for 144,000 you should pay 144,000 {}'.format(valuing_houses_2_answer)

In [41]:
def valuing_houses_3():
    return valuing_houses_1(2100)

**Will cause an error (specifically IndexError) because there is no house that is 2100 sq ft**

In [43]:
valuing_houses_3()

IndexError: index 0 is out of bounds for axis 0 with size 0

*Let's interpolate*


In [67]:
def valuing_houses_3(sizes=(2100,), size=2100):
    """
    Interpolates dependent linearly over independent
    http://stackoverflow.com/a/27217695/1175496
    :returns : Expected cost of new house given its sq footage
    """
    #Appending a row with a null cost, but sizes
    svc_appended = size_v_cost.copy().append(pd.DataFrame({'house_size':sizes}), True)
    #Since I have indexes, I use sort_index;
    #Avoiding FutureWarning: order is deprecated
    cost_interpolated = pd.Series(index=svc_appended.house_size, data=svc_appended.cost.values).sort_index().interpolate(method='values')[size]
    
    return cost_interpolated

In [68]:
valuing_houses_3()

168000.0

In [69]:
valuing_houses_3_answer = 'https://www.udacity.com/course/viewer#!/c-st101/l-48696651/e-48532778/m-48204890'
assert valuing_houses_3() == 168000, 'Expected value of 2,100 sq ft home is 168,000 {}'.format(valuing_houses_3_answer)

In [72]:
def valuing_houses_4():
    return valuing_houses_3(sizes=(1500,), size=1500)

In [81]:
s=pd.Series(data=(1, None, 4, 8, 9, None, 11), index=(10, 30, 40, 50, 60, 62, 80))


In [86]:
# Note how interpolate() by default uses method='linear';
# > ‘linear’: ignore the index and treat the values as equally spaced.
# So the first None becomes 2.5; equidistant from  1 and 4.0 ; 
# Same 1.5 distance from adjoining points *in the same column*
s.interpolate()

10     1.0
30     2.5
40     4.0
50     8.0
60     9.0
62    10.0
80    11.0
dtype: float64

In [87]:
# Whereas interpolate(method='values')
# > ‘index’, ‘values’: use the actual numerical values of the index
# So the first None becomes 3.0; proportianately same distance from  1 and 4.0 ; 
# As *the adjoining points in the index* are from corresponding index value 30; 
# Distance of 2/3 of the way between index values 10 and 40;
# Produces proportinal interpolation for the data points; 2/3 between 1 and 4 is 3.
s.interpolate(method='values')

10     1.0
30     3.0
40     4.0
50     8.0
60     9.0
62     9.2
80    11.0
dtype: float64

In [71]:
valuing_houses_4_answer = 'https://www.udacity.com/course/viewer#!/c-st101/l-48696651/e-48696650/m-48299936'
assert valuing_houses_4() == 120000, 'Expected value of  1,500 sq ft home is 120,000 {}'.format(valuing_houses_4_answer)

# Proportionality is Constant
It turns out that cost per sq ft is the same for all house datapoints

In [32]:
def valuing_houses_5():
    cost_per_sqft = size_v_cost.cost/size_v_cost.size
    different_costs_per_sqft = cost_per_sqft.value_counts()
    assert len(different_costs_per_sqft)==1, 'There are multiple costs per sq ft, cannot return just one'
    return cost_per_sqft.iloc[0]

In [33]:
valuing_houses_5_answer = 'https://www.udacity.com/course/viewer#!/c-st101/l-48696651/e-48369931/m-48634702'
assert valuing_houses_5()==80, 'Cost per sq ft ($/sqft) should be 80 {}'.format(valuing_houses_5_answer)