# Data Cleansing
These notebook will explore our raw data after the AWS Lambda data pipeline. The final output of that pipeline is a large amount of JSON data files dumped to S3 with data coming from Craigslist, Mapquest and WalkScore.com.

## Import Our Libraries

In [1]:
import boto3
import pandas as pd
from pprint import pprint
import json
import numpy as np
import re

## AWS Credentials

In [2]:
# Get Credentials
credentials = Config.read_credentials()
aws_secret = credentials['aws']['aws_secret_access_key']
aws_access_key = credentials['aws']['aws_access_key_id']

NameError: name 'Config' is not defined

## Bring in the raw JSON data

In [None]:
# Get keys of files
s3 = boto3.client('s3',
                 aws_access_key_id=aws_access_key,
                 aws_secret_access_key=aws_secret)

bucket = 'lazyapartment'
objects = s3.list_objects_v2(Bucket='lazyapartment', Prefix='walkScoreEnhancedData/')

## Combine all the raw JSON to Pandas for exploration

In [None]:
# Put all of the keys into a list
keys = [obj['Key'] for obj in objects['Contents']][1:]

# Put all the raw data into a pandas dataframe
data = []
s3 = boto3.resource('s3')
for key in keys:
    bucket_object = s3.Object(bucket, key)
    contents = bucket_object.get()['Body'].read().decode('utf-8')
    json_data = json.loads(contents)

    for apartments in json_data:
        data.append(apartments)

df = pd.DataFrame(data)
# Lower all string features
df['name'] = df['name'].str.lower()
df['where'] = df['where'].str.lower()

## Export the DF for easy access later

In [None]:
# # Export for easy access later
# df = df.set_index('id')
# df = df.to_csv('housing.csv')
df = pd.read_csv('housing.csv')

## Clean the data

### Separate out latitude and longtidue, drop Geotag column

In [None]:
df['lat'] = df['geotag'].apply(lambda x: x[1:x.index(',')])
df['lon'] = df['geotag'].apply(lambda x: x[x.rindex(',')+1:-1])
df.drop(columns='geotag', axis=1, inplace=True)

### Clean up area
Unfortunately many of the apartment listings don't have an area in square feet included. For those that do have square footage they come in as strings, so here we remove "ft2" from the column and convert it to a numeric datatype. We can also make a feature for whether or not the posting includes the square footage.

In [None]:
def cleanUpArea(row):
    if type(row) == str:
        row = int(row.replace('ft2', ''))
    else:
        row
    return row

df['area'] = df['area'].apply(lambda x: cleanUpArea(x))
df['includes_area'] = df['area'].apply(lambda x: 0 if np.isnan(x) else 1)

### Clean up Price
Similarly price is a string as it is prefixed with a '$'

In [None]:
df['price'] = df['price'].apply(lambda x: x.replace('$', '')).astype(int)

### Dates
Convert the datetime field to a datetime and extract features

In [None]:
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %H:%M')
df['year'] = df['datetime'].dt.year
df['month'] = df['datetime'].dt.month
df['dow'] = df['datetime'].dt.dayofweek
df['day'] = df['datetime'].dt.day
df['hour'] = df['datetime'].dt.hour

### Missing Bedrooms
For whatever reason Craigslist always labels Studios to have 1 bedroom, or often it is missing. For now we will lazily fill apartments that have the word studio in the title with 0 bedrooms or otherwise use the value provided.

In [None]:
df = pd.read_csv('housing.csv')
def parseBedrooms(row):
    if any(re.findall(r'studio', row['name'], re.IGNORECASE)):
        return 0
    if np.isnan(row['bedrooms']):
        if any(re.findall(r'1br|1bedroom|1bd|1 bedroom', row['name'], re.IGNORECASE)):
            return 1
        elif any(re.findall(r'1br|1bedroom|1bd|1 bedroom', row['name'], re.IGNORECASE)):
            return 2
        elif any(re.findall(r'1br|1bedroom|1bd|1 bedroom', row['name'], re.IGNORECASE)):
            return 3
    else:
        return row['bedrooms']

df['bedrooms'] = df.apply(parseBedrooms, axis=1)

### No Fee
Many apartments in New York have an additional fee (realtors) attached. Let's create a feature that is whether or not "No Fee" is advertised in the title.

In [None]:
df['advertises_no_fee'] = df['name'].apply(lambda x: 1 if 'no fee' in x.lower() else 0)

### Repost
Many apartments are reposted on Craigslist if they are not sold as this will put them back at the top of the list for people to see who sort by "Newest to Oldest". Let's create a feature for if this is a report or not.

In [None]:
df['is_repost'] = df['repost_of'].apply(lambda x: 1 if not np.isnan(x) else 0)

### Convert booleans to 1/0's

In [None]:
df['has_image'] = df['has_image'].astype(int)
df['has_map'] = df['has_map'].astype(int)
df['sideOfStreetEncoded'] = df['sideOfStreet'].map({'L':0, 'R':1})

### One hot encode location
We can one hot encode the postal code as this will play nicer with our algorithm of choice later. We will use a chopped version of the postal codes to prevent our matrix from becoming too sparse.

In [None]:
df['postalCodeChopped'] = df['postalCode'].astype(str).apply(lambda x: x[0:x.index('-')] if '-' in x else x)

## Remove Price Outliers
There are a couple of apartments that are way out there for prices (it is NYC after all). This is an exploration for the common man, so let's remove absurdly expensive apartments.

In [None]:
price_std = df['price'].std()
df = df[df['price'] < (df['price'].mean() + 3*price_std)]

## Export
Now that the data is cleaned (somewhat) we can start exploring the relationship among the different features and our response of price. This is done in the Data Exploration - NYC Apartments notebook

In [None]:
df = df.set_index('id')
df.to_csv('housing_cleaned.csv')