### In this script, we look at homes close to one specific university and trying to see if the saleprice has a strong correlation with some home feature. We confine the analysis to one university's neighborhood to eliminate the location variable.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
%matplotlib inline
pd.set_option('display.max_columns', 100)

In [2]:
allPair = pd.read_csv("../data/allPair.csv",low_memory=False)

In [5]:
# We focus on CA because CA has most homes and universities.
CA = allPair[allPair["state"] == "CA"]

In [8]:
print CA.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16300 entries, 0 to 16299
Data columns (total 56 columns):
website_x              16300 non-null object
home_url               16298 non-null object
property_type          16300 non-null object
record_type            16300 non-null object
parser_type            16300 non-null object
latitude_x             16300 non-null float64
longitude_x            16300 non-null float64
streetaddr_x           16300 non-null object
city_x                 16300 non-null object
state                  16300 non-null object
zipcode                16300 non-null object
country                16300 non-null object
numbed                 16300 non-null int64
num_bath_full          16300 non-null int64
num_bath_part          16300 non-null int64
rentalprice_min        16300 non-null int64
rentalprice_max        16300 non-null int64
saleprice              16300 non-null int64
yearbuilt              16300 non-null object
floor_plan             0 non-null float6

### We perform analysis on 5 universities with most homes paired to it.

In [19]:
count = CA.groupby('UniversityName').size()
count.sort_values(inplace=True,ascending=False)
print count[:5]

UniversityName
University of Redlands                            1269
William Jessup University                          983
California State University   - San Bernardino     783
California State University   - Bakersfield        527
The Sage Colleges                                  422
dtype: int64


In [36]:
top5Universities = count[:5].to_dict().items()
top5Universities = sorted(top5Universities, key = lambda x: -x[1])
print top5Universities

[('University of Redlands', 1269), ('William Jessup University', 983), ('California State University   - San Bernardino', 783), ('California State University   - Bakersfield', 527), ('The Sage Colleges', 422)]


In [41]:
topUniv = []
for i in range(5):
    topUniv.append(CA.groupby('UniversityName').get_group(top5Universities[i][0]))

print len(topUniv)
print type(topUniv[0])
print topUniv[0].shape

5
<class 'pandas.core.frame.DataFrame'>
(1269, 56)


### For each university, find the correlation between home saleprice and each home feature.

In [59]:
distanceCorr = []
for df in topUniv:
    print "********************"
    df = df[df['yearbuilt'] != "\\N"]
    df["yearbuilt"] = df["yearbuilt"].astype(int)
    corr = df.corr()
    price_corr = corr["saleprice"]
    price_corr = price_corr.to_dict()
    # print price_corr
    print df.iloc[0]['UniversityName']
    print "number of nearest homes: %d" %df.shape[0]
    for ele in sorted(price_corr.items(), key = lambda x: -abs(x[1])):
        print("{0}: \t{1}".format(*ele))
        if ele[0] == 'distance':
            distanceCorr.append(ele[1])
print "********************"
print distanceCorr
print sum(distanceCorr) / len(distanceCorr)

********************
University of Redlands
number of nearest homes: 811
size: 	0.5109935229
num_bath_part: 	0.0718129978106
rentalprice_max: 	nan
floor_plan: 	nan
saleprice: 	1.0
numbed: 	0.324790874931
pool: 	0.292271420793
fireplace: 	0.246796874385
gatedCommunity: 	0.231519819251
latitude_x: 	-0.177033129654
yearbuilt: 	0.135544209816
stainlessAppliances: 	0.102827046209
distance: 	0.0459389817743
num_bath_full: 	-0.00134995868744
rentalprice_min: 	nan
longitude_x: 	0.0425178583378
renovation: 	0.0409305000956
latitude_y: 	1.35932380658e-15
longitude_y: 	1.21966244575e-15
********************
William Jessup University
number of nearest homes: 669
size: 	0.486744243119
num_bath_part: 	0.131904881543
rentalprice_max: 	nan
floor_plan: 	nan
num_bath_full: 	0.43385600787
distance: 	0.22677896713
rentalprice_min: 	nan
saleprice: 	1.0
numbed: 	0.308435803851
longitude_x: 	0.271273072959
pool: 	0.169028203481
yearbuilt: 	0.137990793794
fireplace: 	0.115496909617
latitude_x: 	0.10249318113


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


### As expected, home size is always the top feature that saleprice depends on. But the correlation between saleprice and other features is not significant. Though is an interesting experiment, it doesn't help too much with analyzing universities' impact.