# Final Project for IBM Data Science Program

## My project: Find My Home 


### Introduction Section

For my capstone project, I will be looking at the city of Sydney, Australia. The user that I have in mind is someone that is looking to decide on a neighborhood to move to. My goal is to be able to find the neighborhoods that best match the user. To accomplish this, I will be developing profiles for each neighborhood and matching it to the user's lifestyle preferences (selected via their top 5 important venue catgories to them). 

### Data Section

As was done in the previous modules, a data table with the coordinates and names of neighborhoods in Sydney is needed. I will start with an HTML table that is accessible via a search for this information. Then, we will use Foursquare data to get venue data for each area. This will be enough to develop a clustering of similar areas. 

A content-based recommendation system will then be developed using venue categories as the features of the neighborhood. No additional data will be needed. I will use a fake user profile to showcase the system's output given their top 5 venue categories. Below is my first step of getting the location data.

In [3]:
import numpy as np 

import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json 

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim 

import requests 
from pandas.io.json import json_normalize 

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2020.12.5          |   py36h5fab9bb_1         143 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-2.1.0                |     pyhd3deb0d_0          64 KB  conda-forge
    openssl-1.1.1j             |       h7f98852_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.4 MB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-2.1.0-pyhd3deb0d_0

The following packages will be

In [4]:
!conda install -c conda-forge beautifulsoup4 --y
from bs4 import BeautifulSoup


Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    beautifulsoup4-4.9.3       |     pyhb0f4dca_0          86 KB  conda-forge
    soupsieve-2.0.1            |             py_1          30 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         116 KB

The following NEW packages will be INSTALLED:

  beautifulsoup4     conda-forge/noarch::beautifulsoup4-4.9.3-pyhb0f4dca_0
  soupsieve          conda-forge/noarch::soupsieve-2.0.1-py_1



Downloading and Extracting Packages
beautifulsoup4-4.9.3 | 86 KB     | ##################################### | 100% 
soupsieve-2.0.1      | 30 KB     | #################

#### Now we are ready to go for the table from the HTML page

In [66]:
geo = requests.get('https://www.geonames.org/postal-codes/AU/NSW/new-south-wales.html')
geo.status_code

200

In [67]:
soup = BeautifulSoup(geo.content,'html.parser')

In [68]:
table = soup.find('table',class_='restable') #The Source Code in the HTML points to this as the table.

In [69]:
#Here we are taking each row's data elements (td) and stripping it of the text. 
table_rows = table.find_all('td')

def slice_per(source, step):
    return [source[i::step] for i in range(step)]

data = slice_per(table_rows,9) #Due to complexity of table structure, this function will separate out every 9th item for each soon-to-be column

The HTML list of td had a repetition of every 9 had all of the row information. (0 is #, 1 is Place, 2 is code, 3 is country, 4 is state, 5 is area, 6 is what broke the soup extraction, 7 nil, 8 is latlong.) 

Now, each list in the 'data' list needs to be formatted before being added to a column in a dataframe. We only need Place, Code, Area, and latlong. As part of this, the latlong list will need to be split into latitude and longitude.

In [70]:
#Let's remove the items that we don't need
del data[7]
del data[6]
del data[4]
del data[3]
del data[0]

In [71]:
#The main problem with most of them is the remaining HTML tags. Let's remove them.
for sublist in data:
    for i, s in enumerate(sublist):
        sublist[i] = str(s).replace('<td>','').replace('</td>','')

In [72]:
#The Latlong item has a lot more in it than just td tags. We need a different approach to get to the coordinates only.
import re

for i in range(len(data[3])):
    s = data[3][i]
    result = re.search('<small>(.*)</small>', s)
    data[3][i] = result.group(1)
    

In [74]:
df = pd.DataFrame({'Place':data[0],'Code':data[1],'Area':data[2],'Latlong':data[3]}) #Establish our dataframe

In [79]:
df[['Latitude','Longitude']] = df['Latlong'].str.split('/',expand=True) #Split Latitude and Longitude into separate columns

In [83]:
df.drop('Latlong',axis=1,inplace=True) #Finally remove Latlong column
df.head() #Complete!

Unnamed: 0,Place,Code,Area,Latitude,Longitude
0,Haymarket,2000,SYDNEY STREETS,-33.88,151.205
1,Ultimo,2007,SYDNEY STREETS,-33.881,151.198
2,Chippendale,2008,SYDNEY STREETS,-33.886,151.199
3,Pyrmont,2009,SYDNEY STREETS,-33.87,151.194
4,Surry Hills,2010,SYDNEY STREETS,-33.885,151.212
