# Data Cleansing
These notebook will explore our raw data after the AWS Lambda data pipeline. The final output of that pipeline is a large amount of JSON data files dumped to S3 with data coming from Craigslist, Mapquest and WalkScore.com.

## Import Our Libraries

In [379]:
import boto3
import pandas as pd
import Config
from pprint import pprint
import json
import numpy as np
import re

## AWS Credentials

In [380]:
# Get Credentials
credentials = Config.read_credentials()
aws_secret = credentials['aws']['aws_secret_access_key']
aws_access_key = credentials['aws']['aws_access_key_id']

## Bring in the raw JSON data

In [381]:
# Get keys of files
s3 = boto3.client('s3',
                 aws_access_key_id=aws_access_key,
                 aws_secret_access_key=aws_secret)

bucket = 'lazyapartment'
objects = s3.list_objects_v2(Bucket='lazyapartment', Prefix='walkScoreEnhancedData/')

## Combine all the raw JSON to Pandas for exploration

In [382]:
# Put all of the keys into a list
keys = [obj['Key'] for obj in objects['Contents']][1:]

# Put all the raw data into a pandas dataframe
data = []
s3 = boto3.resource('s3')
for key in keys:
    bucket_object = s3.Object(bucket, key)
    contents = bucket_object.get()['Body'].read().decode('utf-8')
    json_data = json.loads(contents)

    for apartments in json_data:
        data.append(apartments)

df = pd.DataFrame(data)
# Lower all string features
df['name'] = df['name'].str.lower()
df['where'] = df['where'].str.lower()

# Export for easy access later
df = df.set_index('id')
df = df.to_csv('housing.csv')

## Export the DF for easy access later

In [383]:
df = pd.read_csv('housing.csv')

## Clean the data

### Separate out latitude and longtidue, drop Geotag column

In [384]:
df['lat'] = df['geotag'].apply(lambda x: x[1:x.index(',')])
df['lon'] = df['geotag'].apply(lambda x: x[x.rindex(',')+1:-1])
df.drop(columns='geotag', axis=1, inplace=True)

### Clean up area
Unfortunately many of the apartment listings don't have an area in square feet included. For those that do have square footage they come in as strings, so here we remove "ft2" from the column and convert it to a numeric datatype. We can also make a feature for whether or not the posting includes the square footage.

In [385]:
df.head(20)
def cleanUpArea(row):
    if type(row) == str:
        row = int(row.replace('ft2', ''))
    else:
        row
    return row

df['area'] = df['area'].apply(lambda x: cleanUpArea(x))
df['includes_area'] = df['area'].apply(lambda x: 0 if np.isnan(x) else 1)

### Clean up Price
Similarly price is a string as it is prefixed with a '$'

In [386]:
df['price'] = df['price'].apply(lambda x: x.replace('$', '')).astype(int)

### Dates
Convert the datetime field to a datetime and extract features

In [387]:
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %H:%M')
df['year'] = df['datetime'].dt.year
df['month'] = df['datetime'].dt.month
df['dow'] = df['datetime'].dt.dayofweek
df['day'] = df['datetime'].dt.day
df['hour'] = df['datetime'].dt.hour

### Missing Bedrooms
For now we will hackily parse for information about the bedrooms that don't include the information. Eventually this seems to be a not very difficult NLP problem to solve once enough data is acquired. I say not very difficult as a 2 bedroom apartment is often, in text, described with either "2BR", "2BD" or "2 Bedroom" which should be easy enough to solve with NLP.

In [388]:
def parseBedrooms(row):
    if np.isnan(row['bedrooms']):
        if any(re.findall(r'studio', row['name'], re.IGNORECASE)):
            return 0
        elif any(re.findall(r'1BD|1 BD|1BR|1 BR|1 bedroom|1bedroom|one bedroom', row['name'], re.IGNORECASE)):
            return 1
        elif any(re.findall(r'2BD|2 BD|2BR|2 BR|2 bedroom|2bedroom|two bedroom', row['name'], re.IGNORECASE)):
            return 2
        elif any(re.findall(r'3BD|3 BD|3BR|3 BR|3 bedroom|3bedroom|three bedroom', row['name'], re.IGNORECASE)):
            return 3
    else:
        return row['bedrooms']

df['bedrooms_filled'] = df.apply(parseBedrooms, axis=1)

### No Fee
Many apartments in New York have an additional fee (realtors) attached. Let's create a feature that is whether or not "No Fee" is advertised in the title.

In [389]:
df['advertises_no_fee'] = df['name'].apply(lambda x: 1 if 'no fee' in x.lower() else 0)

### Convert booleans to 1/0's

In [390]:
df['has_image'] = df['has_image'].astype(int)
df['has_map'] = df['has_map'].astype(int)
df['sideOfStreetEncoded'] = df['sideOfStreet'].map({'L':0, 'R':1})

### One hot encode location
We can one hot encode the postal code as this will play nicer with our algorithm of choice later. We will use a chopped version of the postal codes to prevent our matrix from becoming too sparse.

In [391]:
df['postalCodeChopped'] = df['postalCode'].apply(lambda x: x[0:x.index('-')] if '-' in x else x)

In [392]:
df = pd.concat([df, pd.get_dummies(df['postalCodeChopped'], prefix='postal')], axis=1)

## Export
Now that the data is cleaned (somewhat) we can start exploring the relationship among the different features and our response of price. This is done in the Data Exploration - NYC Apartments notebook