## Download the Data

You could use your web browser to download the file and run tar xzf housing.tgz to decompress it and extract the CSV file, but it is preferable to create a small function to do that. Having a function that downloads the data is useful in particular if the data changes regularly: you can write a small script that uses the function to fetch the latest data (or you can set up a scheduled job to do that automatically at regular intervals)

In [47]:
import os
import tarfile
import urllib


DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

Now when you call fetch_housing_data(), it creates a datasets/housing directory in your workspace, downloads the housing.tgz file, and extracts the housing.csv file from it in this directory.

In [48]:
fetch_housing_data()

### Load the Data

Load the dataset, adjust the dataframe columns in such a way that the target variable will be at the end.

In [49]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [50]:
housing = load_housing_data()


In [51]:
columns=['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
        'ocean_proximity','median_house_value']

In [52]:
housing=housing[columns]

In [53]:
housing.head(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,NEAR BAY,452600.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,NEAR BAY,358500.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,NEAR BAY,352100.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,NEAR BAY,341300.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,NEAR BAY,342200.0


### Random Split Train-Test

In [54]:
import numpy as np
rand_split = np.random.rand(len(housing))
train_list = rand_split < 0.8
test_list = (rand_split >= 0.8) & (rand_split < 1)

train_data= housing[train_list]
test_data = housing[test_list]

In [55]:
train_data.to_csv('train_data_without_header.csv',header=False,index=False)
test_data.to_csv('test_data_without_header.csv',header=False,index=False)

### Upload to S3 bucket

Upload train and test data to s3 

In [56]:
import sagemaker

files= ['train_data_without_header.csv','test_data_without_header.csv']
session = sagemaker.Session()

for file in files:
    url = session.upload_data(
        #path="datasets/housing/housing.csv", # Local Path
        file,
        bucket="ta-sagemaker-experiments",
        key_prefix="housing/input-datasets"
    )
    
print(url)

s3://ta-sagemaker-experiments/housing/input-datasets/test_data_without_header.csv
