# Assignment 1

## What is the code doing?

This code is looking at the provided `rents.csv` dataset. First we read the data set and rename the columns that we're going to be referencing a lot to more descriptive titles. Then we replaced all the `NA` values with 0s. Then we cleaned up the columns that had non numeric characters in their strings so that we could coerce them to floats. After we had coerced the data to floats, we performed summary stats the columns we were inquiring about.

Then we made a subset of the data that included all properties with rents between 1000 and 4000, excluding those with missing values for either number of bedrooms or total square footage. We then performed the same summary stats on this subset. 

## My Development Process

Cleaning and subsetting the data was fairly straightforward. It was more challenging to develop a function to get the summary statistics for each column. This required me to read some documentation on the `apply` method for pd.arrays. Once I was able to figure out how to use apply to apply a function to all the columns I wanted at once, I was able to use this for all instances where I was initially writing 3 very similar lines of code. Once I figured out how to make a dataframe to hold my summary statistics, and how to fill it with the correct values, I was able to abstract this to a function `summary_stats`.

In [1]:
# import necessary packages
import numpy as np
import pandas as pd

In [2]:
# import csv
rents = pd.read_csv('assignment-01/rents.csv') # path relative to notebook file

# take a look at this csv 
rents.head()

Unnamed: 0,city,rent,br,sqft
0,Boston,$675.00,1.0,560 ft2
1,Boston,$772.00,1.0,608 ft2
2,Boston,$789.00,1.0,618 ft2
3,Boston,$795.00,1.0,622 ft2
4,Boston,$800.00,1.0,629 ft2


In [3]:
# rename the columns
rents = rents.rename(columns={'rent':'rent_dollars', 'br':'no_bedrooms', 'sqft':'squarefeet'})

# made an array of column names that will be referenced multiple times
key_cols = ['rent_dollars', 'no_bedrooms', 'squarefeet']

In [4]:
# fill NA values with 0s across all key columns
rents[key_cols] = rents[key_cols].apply(lambda d: d.fillna(0))

In [5]:
# clean up the rent column so that there are no longer dollar signs
rents['rent_dollars'] = rents['rent_dollars'].str.replace('$', '').str.strip()

In [6]:
# clean up the square feet column so there is no longer the unit reference
rents['squarefeet'] = rents['squarefeet'].str.replace(" ft2", '').str.strip()

In [7]:
# change type of all key columns to float
rents[key_cols] = rents[key_cols].apply(lambda d: d.astype(float))

In [8]:
# write a function to calculate stats for given data frame and given columns
# Dataframe, Iterable [column names] -> Dataframe
def summary_stats(df, keys):
    # create iterable of column names
    columns = ["mean", "min", "max"]

    # create empty data frame to hold stats about each column
    stats = pd.DataFrame(index=keys, columns=columns)

    # fill stats column with corresponding statistic about given column from rents
    stats["mean"] = df[keys].apply(lambda c: c.mean())
    stats["min"] = df[keys].apply(lambda c: c.min())
    stats["max"] = df[keys].apply(lambda c: c.max())

    return stats



# print this out to look at 
stats = summary_stats(rents, key_cols)
stats

Unnamed: 0,mean,min,max
rent_dollars,3364.094,675.0,11510.0
no_bedrooms,2.458,0.0,4.0
squarefeet,1213.11066,560.0,1893.0


In [9]:
# create subset that contains rents between 1000 and 4000 
# as long as they have valid values for number of bedrooms and square footage
subset = rents[(rents['rent_dollars'] >= 1000) & (rents['rent_dollars'] <= 4000) 
      & (rents['squarefeet'] > 0) & (rents['no_bedrooms'] > 0)]

In [10]:
# create empty data frame to hold stats about subset
subset_stats = summary_stats(subset, key_cols)
subset_stats

Unnamed: 0,mean,min,max
rent_dollars,2538.721649,1012.0,3995.0
no_bedrooms,1.977909,1.0,3.0
squarefeet,1117.718704,724.0,1337.0
