In [None]:
# Pandas Data Cleaning and Exploratory Data Analysis (EDA)

In [9]:
import pandas as pd

## Upload Data

In [10]:
housing_header = ["HomeID", "HomeAge", "HomeSqft", "LotSize", "BedRooms", 
                  "HighSchoolAPI", "ProxFwy", "CarGarage", "ZipCode", "HomePriceK"]
df = pd.read_csv("fixed-housing-data.csv",names=housing_header)
#what does this do?

In [11]:
df.head()

Unnamed: 0,HomeID,HomeAge,HomeSqft,LotSize,BedRooms,HighSchoolAPI,ProxFwy,CarGarage,ZipCode,HomePriceK
0,1,24,1757,6056,2,899,3,3,94085,894
1,2,10,1563,6085,2,959,4,3,94085,861
2,3,14,1344,6089,2,865,4,3,94085,831
3,4,14,1215,6129,3,959,4,2,94085,809
4,5,24,1866,6141,3,877,4,1,94085,890


Why did we only want to display the first 5 rows of the dataframe?

What if we wanted to see the size of this dataframe?

In [12]:
# number of rows
len(df)


100

In [13]:
# shape of df (rows, columns)
df.shape

(100, 10)

## Change Column Name(s)

Why would we want to change the column names?

In [14]:
df = df.rename(columns={'HighSchoolAPI': 'SchoolAPI'})
df.head()

Unnamed: 0,HomeID,HomeAge,HomeSqft,LotSize,BedRooms,SchoolAPI,ProxFwy,CarGarage,ZipCode,HomePriceK
0,1,24,1757,6056,2,899,3,3,94085,894
1,2,10,1563,6085,2,959,4,3,94085,861
2,3,14,1344,6089,2,865,4,3,94085,831
3,4,14,1215,6129,3,959,4,2,94085,809
4,5,24,1866,6141,3,877,4,1,94085,890


## Create New Columns

What is new information about "Block_Location" that we can actually use and save?

Let's create new columns for the information we extracted from those values.

In [15]:
prices_2019 = [(price * 1.04) for price in df["HomePriceK"]]
df["Price2019"] = prices_2019
#Check if it worked
df.head()

Unnamed: 0,HomeID,HomeAge,HomeSqft,LotSize,BedRooms,SchoolAPI,ProxFwy,CarGarage,ZipCode,HomePriceK,Price2019
0,1,24,1757,6056,2,899,3,3,94085,894,929.76
1,2,10,1563,6085,2,959,4,3,94085,861,895.44
2,3,14,1344,6089,2,865,4,3,94085,831,864.24
3,4,14,1215,6129,3,959,4,2,94085,809,841.36
4,5,24,1866,6141,3,877,4,1,94085,890,925.6


## Drop Columns

In [16]:
df = df.drop("ProxFwy", axis = 1)
#Check if it dropped
df.head()

Unnamed: 0,HomeID,HomeAge,HomeSqft,LotSize,BedRooms,SchoolAPI,CarGarage,ZipCode,HomePriceK,Price2019
0,1,24,1757,6056,2,899,3,94085,894,929.76
1,2,10,1563,6085,2,959,3,94085,861,895.44
2,3,14,1344,6089,2,865,3,94085,831,864.24
3,4,14,1215,6129,3,959,2,94085,809,841.36
4,5,24,1866,6141,3,877,1,94085,890,925.6


In [17]:
df.ZipCode.unique()

array([94085, 95051, 94087, 95014])

In [18]:
df["CarGarage"].unique()

array([3, 2, 1, 0])

# EXPLORATORY DATA ANALYSIS

<h3>"Exploratory data analysis or 'EDA' is a <b>critical</b> beginning step in analyzing the data from an experiment.</h3>

<b>Here are the main reasons we use EDA:</b>
<ul>
• detection of mistakes<br><br>
• checking of assumptions<br><br>
• preliminary selection of appropriate models<br><br>
• determining relationships among the explanatory variables, and<br><br>
• assessing the direction and rough size of relationships between explanatory and outcome variables."</ul>


## Now what?

We have cleaned our data to the best of our ability based on the initial look. Now let's try to look at the <b>relationships</b> between different values. 

In [19]:
df.head()

Unnamed: 0,HomeID,HomeAge,HomeSqft,LotSize,BedRooms,SchoolAPI,CarGarage,ZipCode,HomePriceK,Price2019
0,1,24,1757,6056,2,899,3,94085,894,929.76
1,2,10,1563,6085,2,959,3,94085,861,895.44
2,3,14,1344,6089,2,865,3,94085,831,864.24
3,4,14,1215,6129,3,959,2,94085,809,841.36
4,5,24,1866,6141,3,877,1,94085,890,925.6


Let's look at the different types of offenses that were called in. We know that using the .unique() function will return all the unique values in the column, but what if we wanted to also <b>count</b> the different times each unique value appeared?

In [20]:
df.ZipCode.value_counts()

95051    25
95014    25
94087    25
94085    25
Name: ZipCode, dtype: int64

In [25]:
df.CarGarage.value_counts().sort_index()
#print(df1.so())

0    31
1    18
2    19
3    32
Name: CarGarage, dtype: int64

Why is "LARCENY" a higher occurence in the "EVENTDESC" column, if when we looked into the "OFFENSE" column, "BURGLARY - VEHICLE" is first? Let's look into this a little more.


## GroupBy 

In [28]:
df1 = df.groupby("ZipCode").CarGarage.value_counts().sort_index()
#Try to remove to_frame() and see
#Try to display in sorted order of Car Garage. Use sort_index()
print(df1)

ZipCode  CarGarage
94085    0             4
         1             7
         2             6
         3             8
94087    0             9
         1             7
         2             4
         3             5
95014    0             9
         1             2
         2             3
         3            11
95051    0             9
         1             2
         2             6
         3             8
Name: CarGarage, dtype: int64


## More about GROUP BY
"This grouped variable is now a GroupBy object. It has not actually computed anything yet except for some intermediate data about the group key df['key1']. The idea is that this object has all of the information needed to then apply some operation to each of the groups." - Python for Data Analysis

In [38]:
#Use list() to show what a grouping looks like

df.groupby("ZipCode")

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x10bfebc50>

Descriptive statistics by group

In [104]:
#returns a dict of your groups
df.groupby("ZipCode").groups

{94085: Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 16, 17,
             19, 20, 21, 22, 24, 25, 26, 28],
            dtype='int64'),
 94087: Int64Index([40, 41, 45, 47, 48, 53, 55, 56, 57, 59, 60, 61, 62, 63, 64, 66, 67,
             68, 69, 71, 74, 75, 77, 78, 79],
            dtype='int64'),
 95014: Int64Index([65, 70, 72, 73, 76, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91,
             92, 93, 94, 95, 96, 97, 98, 99],
            dtype='int64'),
 95051: Int64Index([15, 18, 23, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 42, 43,
             44, 46, 49, 50, 51, 52, 54, 58],
            dtype='int64')}

In [33]:
df.groupby("ZipCode").LotSize.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
ZipCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
94085,25.0,6531.48,366.680279,6056.0,6183.0,6514.0,6870.0,7098.0
94087,25.0,8279.68,467.047439,7426.0,7958.0,8348.0,8585.0,8974.0
95014,25.0,9145.28,275.174266,8446.0,9095.0,9211.0,9337.0,9476.0
95051,25.0,7405.56,359.942134,6680.0,7181.0,7339.0,7693.0,8096.0


In [34]:
df.groupby("ZipCode").SchoolAPI.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
ZipCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
94085,25.0,907.0,38.22303,851.0,877.0,904.0,935.0,966.0
94087,25.0,899.24,33.121343,850.0,876.0,890.0,927.0,962.0
95014,25.0,894.8,32.430695,850.0,862.0,889.0,924.0,942.0
95051,25.0,916.68,39.359158,853.0,891.0,918.0,949.0,975.0


### Get Columns + Index

In [35]:
df.columns

Index(['HomeID', 'HomeAge', 'HomeSqft', 'LotSize', 'BedRooms', 'SchoolAPI',
       'CarGarage', 'ZipCode', 'HomePriceK', 'Price2019'],
      dtype='object')

In [36]:
list(df.columns)

['HomeID',
 'HomeAge',
 'HomeSqft',
 'LotSize',
 'BedRooms',
 'SchoolAPI',
 'CarGarage',
 'ZipCode',
 'HomePriceK',
 'Price2019']

# <font color = "red">Pandas HW 1</font>

Could there be any relationship between "Price per lot size Sqft" and "Price per home Sqft"? What can be the takeaway message from the data we have? Try out different functions to see if there is any significance?

In [56]:
# Your code here ...
#Calc price/lot sqft and price/home sqft - add another column and then for each zip code do a desc

df["PricePerLot"]=round(df["HomePriceK"]*1000/df["LotSize"],2)
print("Price/Lot size sqft by Zipcodes:")
print(df.groupby("ZipCode").PricePerLot.describe())
print("\n")

df["PricePerHome"]=round(df["HomePriceK"]*1000/df["HomeSqft"],2)
print("Price/Home size sqft by Zipcodes:")
print(df.groupby("ZipCode").PricePerHome.describe())

Price/Lot size sqft by Zipcodes:
         count      mean       std     min     25%     50%     75%     max
ZipCode                                                                   
94085     25.0  135.8696  5.861388  125.97  131.59  132.95  141.02  147.62
94087     25.0  139.3828  6.342052  131.48  135.12  138.13  142.01  154.70
95014     25.0  138.1952  4.021741  130.34  134.78  139.18  141.96  143.09
95051     25.0  138.2388  4.039974  131.42  136.04  138.60  141.44  146.08


Price/Home size sqft by Zipcodes:
         count      mean         std     min     25%     50%     75%     max
ZipCode                                                                     
94085     25.0  558.0148   87.981158  432.97  497.82  550.86  618.30  746.93
94087     25.0  746.1456   98.434066  579.14  677.05  755.58  821.66  930.40
95014     25.0  800.9448  114.728785  629.89  723.96  784.62  882.92  978.66
95051     25.0  630.4940  104.803343  480.04  542.49  602.09  714.66  833.20


# <font color = "red">Pandas HW 2</font>

What other data column for the zip codes could make the analysis more precise?

Median houshold income, population, population density??? Inlude one or more new data columns and re-visit your conclusions from HW 1.

Are these home prices driven by factors for which we have the data?

In [102]:
# your code here ...
#From internet find Median household income,population..add a column and do the analysis


df.loc[df.ZipCode==94087,"MedianIncome"]=129668
df.loc[df.ZipCode==95014,"MedianIncome"]=141917
df.loc[df.ZipCode==95051,"MedianIncome"]=106527
df.loc[df.ZipCode==94085,"MedianIncome"]=101051


df.groupby("MedianIncome").HomePriceK.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
MedianIncome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
101051.0,25.0,885.96,34.408671,809.0,865.0,894.0,912.0,934.0
106527.0,25.0,1023.2,46.984927,942.0,991.0,1030.0,1068.0,1097.0
129668.0,25.0,1151.48,28.133788,1103.0,1128.0,1150.0,1179.0,1190.0
141917.0,25.0,1263.32,38.518091,1194.0,1240.0,1269.0,1288.0,1336.0
