## UFO Sightings using K-Means Clustering

The goal of this notebook is to analyze where we should build sighting facilities using the K-Means algorithm. 

1. [Load dataset onto Notebook instance from S3](#Step-1:-Loading-the-data-from-Amazon-S3)
2. [Cleaning, transforming, and preparing the data](#Step-2:-Cleaning,-transforming,-and-preparing-the-data)
3. [Create and train our model](#Step-3:-Create-and-train-our-model)
4. [Viewing the results](#Step-4:-Viewing-the-results)
5. [Visualize using QuickSight](https://docs.aws.amazon.com/quicksight/latest/user/create-a-data-set-s3.html)

In [1]:
#Import all the needed libraries.
import pandas as pd
import numpy as np
from datetime import datetime

import boto3
from sagemaker import get_execution_role
import sagemaker.amazon.common as smac

## Step 1: Loading the data from Amazon S3


In [4]:
#pre-requisites: create S3 bucket, upload files, set IAM roles for Sagemaker to access to S3 bucket
role = get_execution_role()
bucket = 'sagemaker-ml-lab-tx'
prefix = 'ufo_dataset'
data_key = 'ufo_fullset.csv'
data_location = 's3://{}/{}/{}'.format(bucket, prefix, data_key)
print(data_location)

s3://sagemaker-ml-lab-tx/ufo_dataset/ufo_fullset.csv


In [5]:
df = pd.read_csv(data_location, low_memory=False)
df.head()

Unnamed: 0,reportedTimestamp,eventDate,eventTime,shape,duration,witnesses,weather,firstName,lastName,latitude,longitude,sighting,physicalEvidence,contact,researchOutcome
0,1977-04-04T04:02:23.340Z,1977-03-31,23:46,circle,4,1,rain,Ila,Bashirian,47.329444,-122.578889,Y,N,N,explained
1,1982-11-22T02:06:32.019Z,1982-11-15,22:04,disk,4,1,partly cloudy,Eriberto,Runolfsson,52.664913,-1.034894,Y,Y,N,explained
2,1992-12-07T19:06:52.482Z,1992-12-07,19:01,circle,49,1,clear,Miller,Watsica,38.951667,-92.333889,Y,N,N,explained
3,2011-02-24T21:06:34.898Z,2011-02-21,20:56,disk,13,1,partly cloudy,Clifton,Bechtelar,41.496944,-71.367778,Y,N,N,explained
4,1991-03-09T16:18:45.501Z,1991-03-09,11:42,circle,17,1,mostly cloudy,Jayda,Ebert,47.606389,-122.330833,Y,N,N,explained


In [6]:
df.describe()

Unnamed: 0,duration,witnesses,latitude,longitude
count,18000.0,18000.0,18000.0,18000.0
mean,50.516722,5.512889,37.768656,-87.034487
std,28.689429,2.882893,10.074287,38.837258
min,1.0,1.0,-46.163992,-169.541667
25%,26.0,3.0,33.990833,-111.840556
50%,51.0,5.0,38.901111,-87.841806
75%,75.0,8.0,42.271111,-79.121667
max,100.0,10.0,70.416667,178.4419


In [7]:
df.shape

(18000, 15)

## Step 2: Cleaning, transforming, and preparing the data


In [8]:
#Create DataFrame with just the latitude and longitude attributes
df_geo = df[['latitude', 'longitude']]

In [9]:
df_geo.head()

Unnamed: 0,latitude,longitude
0,47.329444,-122.578889
1,52.664913,-1.034894
2,38.951667,-92.333889
3,41.496944,-71.367778
4,47.606389,-122.330833


In [10]:
df_geo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18000 entries, 0 to 17999
Data columns (total 2 columns):
latitude     18000 non-null float64
longitude    18000 non-null float64
dtypes: float64(2)
memory usage: 281.3 KB


In [11]:
missing_values = df_geo.isnull().values.any()
print('Are there any missing values? {}'.format(missing_values))
if(missing_values):
    df_geo[df_geo.isnull().any(axis=1)]

Are there any missing values? False


In [12]:
#transform the pandas DataFrame (dataset) into a numpy.ndarray.
data_train = df_geo.values.astype('float32')
data_train

array([[  47.329445, -122.57889 ],
       [  52.664913,   -1.034894],
       [  38.951668,  -92.333885],
       ...,
       [  36.86639 ,  -83.888885],
       [  35.385834,  -94.39833 ],
       [  29.883055,  -97.94111 ]], dtype=float32)

## Step 3: Create and train our model


In [13]:
#import and use the built-in SageMaker K-Means algorithm
#number of cluster to 10 (for our 10 sensors)
#S3 output bucket for model artifact

from sagemaker import KMeans

num_clusters = 10
output_location = 's3://' + bucket + '/model-artifacts'

kmeans = KMeans(role=role,
               train_instance_count=1,
               train_instance_type='ml.c4.xlarge',
               output_path=output_location,
               k=num_clusters)

In [14]:
job_name = 'kmeans-geo-job-{}'.format(datetime.now().strftime("%Y%m%d%H%M%S"))
print('Here is the job name {}'.format(job_name))

Here is the job name kmeans-geo-job-20200405090707


In [16]:
%%time
kmeans.fit(kmeans.record_set(data_train), job_name=job_name)

2020-04-05 09:07:37 Starting - Starting the training job...
2020-04-05 09:07:38 Starting - Launching requested ML instances......
2020-04-05 09:08:45 Starting - Preparing the instances for training.........
2020-04-05 09:10:22 Downloading - Downloading input data
2020-04-05 09:10:22 Training - Downloading the training image.[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[04/05/2020 09:10:46 INFO 140178056517440] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'_enable_profiler': u'false', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'local_lloyd_num_trials': u'auto', u'_log_level': u'info', u'_kvstore': u'auto', u'local_lloyd_init_method': u'kmeans++', u'force_dense': u'true', u'epochs': u'1', u'init_method': u'random', u'local_lloyd_tol': u'0.0001', u'local_lloyd_max_iter': u'300', u'_disable_wait_to_read': u'false', u'extra_center


2020-04-05 09:10:59 Uploading - Uploading generated training model
2020-04-05 09:10:59 Completed - Training job completed
Training seconds: 49
Billable seconds: 49
CPU times: user 850 ms, sys: 37.9 ms, total: 888 ms
Wall time: 3min 42s


## Step 4: Viewing the results


In [17]:
#upzip
import os
model_key = 'model-artifacts/' + job_name + '/output/model.tar.gz'

boto3.resource('s3').Bucket(bucket).download_file(model_key, 'model.tar.gz')
os.system('tar -zxvf model.tar.gz')
os.system('unzip model_algo-1')

2304

In [19]:
!pip install mxnet

Collecting mxnet
[?25l  Downloading https://files.pythonhosted.org/packages/81/f5/d79b5b40735086ff1100c680703e0f3efc830fa455e268e9e96f3c857e93/mxnet-1.6.0-py2.py3-none-any.whl (68.7MB)
[K    100% |████████████████████████████████| 68.7MB 581kB/s eta 0:00:01
[?25hCollecting graphviz<0.9.0,>=0.8.1 (from mxnet)
  Downloading https://files.pythonhosted.org/packages/53/39/4ab213673844e0c004bed8a0781a0721a3f6bb23eb8854ee75c236428892/graphviz-0.8.4-py2.py3-none-any.whl
Collecting numpy<2.0.0,>1.16.0 (from mxnet)
[?25l  Downloading https://files.pythonhosted.org/packages/07/08/a549ba8b061005bb629b76adc000f3caaaf881028b963c2e18f811c6edc1/numpy-1.18.2-cp36-cp36m-manylinux1_x86_64.whl (20.2MB)
[K    100% |████████████████████████████████| 20.2MB 2.2MB/s eta 0:00:01
Installing collected packages: graphviz, numpy, mxnet
  Found existing installation: numpy 1.14.3
    Uninstalling numpy-1.14.3:
      Successfully uninstalled numpy-1.14.3
Successfully installed graphviz-0.8.4 mxnet-1.6.0 numpy-1

In [20]:
!pip install --upgrade pip

Collecting pip
  Using cached https://files.pythonhosted.org/packages/54/0c/d01aa759fdc501a58f431eb594a17495f15b88da142ce14b5845662c13f3/pip-20.0.2-py2.py3-none-any.whl
Installing collected packages: pip
  Found existing installation: pip 10.0.1
    Uninstalling pip-10.0.1:
      Successfully uninstalled pip-10.0.1
Successfully installed pip-20.0.2


In [21]:
#load to ndarray on mxnet
import mxnet as mx
Kmeans_model_params = mx.ndarray.load('model_algo-1')

In [23]:
cluster_centroids_kmeans = pd.DataFrame(Kmeans_model_params[0].asnumpy())
cluster_centroids_kmeans.columns=df_geo.columns
cluster_centroids_kmeans

Unnamed: 0,latitude,longitude
0,30.821253,-81.567757
1,-16.245201,135.893524
2,35.822197,-117.485504
3,52.011734,0.745384
4,18.357292,55.317184
5,35.205872,-97.975281
6,41.386227,-74.81871
7,40.925266,-86.970459
8,21.651302,-157.831375
9,48.089485,-122.424545


In [24]:
#extract the clustered centroids from the numpy.ndarray
from io import StringIO

csv_buffer = StringIO()
cluster_centroids_kmeans.to_csv(csv_buffer, index=False)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'results/ten_locations_kmeans.csv').put(Body=csv_buffer.getvalue())

{'ResponseMetadata': {'RequestId': '1226BD50A08F99ED',
  'HostId': 'ZW4wC98dDfnVujogVohndiiXA7vPHrq8OOBg4OhfRb0R0F0ajxFxEE3mHY4GYCuJ0TRHJ2FnzPE=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'ZW4wC98dDfnVujogVohndiiXA7vPHrq8OOBg4OhfRb0R0F0ajxFxEE3mHY4GYCuJ0TRHJ2FnzPE=',
   'x-amz-request-id': '1226BD50A08F99ED',
   'date': 'Sun, 05 Apr 2020 09:14:56 GMT',
   'etag': '"c785c66f117b920c650ef67b8aca65e4"',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'ETag': '"c785c66f117b920c650ef67b8aca65e4"'}