In [212]:
%%markdown
# The Latte Line of Sydney
## 1. Problem statement
### What is the Latte Line
Major cities around the world demonstrate a diversity in neighbourhoods in terms of lifestyle, income, social status and ethnic background of people, low- and high density areas, and so on.
In developed countries neighbourhoods are usually mixed based on the above criteria: ideally there are no ghettoes and no exclusive upperclass areas, however, the difference between disctricts can be quite apparent.
In case of Sydney people are talking about a 'latte line' which divides the city to 2 parts, the prosperous North-East and the less attractive South-West. 
As the 'latte line' name suggests, this distinction - while could have been established on hard statistical facts, such as property prices, average income, university addmission rate of high school students - is based on observed or assumed difference of life style.

Can we support or debunk the 'latte line' theory with the help of data science?

# The Latte Line of Sydney
## 1. Problem statement
### What is the Latte Line
Major cities around the world demonstrate a diversity in neighbourhoods in terms of lifestyle, income, social status and ethnic background of people, low- and high density areas, and so on.
In developed countries neighbourhoods are usually mixed based on the above criteria: ideally there are no ghettoes and no exclusive upperclass areas, however, the difference between disctricts can be quite apparent.
In case of Sydney people are talking about a 'latte line' which divides the city to 2 parts, the prosperous North-East and the less attractive South-West. 
As the 'latte line' name suggests, this distinction - while could have been established on hard statistical facts, such as property prices, average income, university addmission rate of high school students - is based on observed or assumed difference of life style.

Can we support or debunk the 'latte line' theory with the help of data science?


In [306]:
%%markdown
## 2. Use of data to evaluate the 'latte line'
The following data sets would come handy:
1. A list of neighborhoods of Sydney
2. Population size of neighbourhoods
3. Foursquare API with venues

For practical reasons the list of neighborhoods will be taken from the Australian Bureau of Statistics, since we can get population size (number of residents) at the same granularity.
Neighbourhoods are going to be assigned with their lat/log coordinates.
Neighbourhoods will be labeled on which side they are of the latte line. For the geographical identification of the latte line we use a report of the Australian national news agency, abc.com.au:
https://www.abc.net.au/news/2019-12-17/sydneys-latte-line-divides-job-and-housing-opportunities/11803706

Foursquare API will be used to fetch venues for the analysed geographical areas.
Since in this particular case life style is the subject of analysis the number of venues will be harmonised on population density: venue count * 1000 / (residents / sqms).

The goal of the analysis is building a model to predict if a neighbourhood based on the life style represented by venues belongs to the up or down side of the latte line with at least 75% accuracy.

Setting the accuracy treshold is arbitrary. While we cannot expect foursquare data to make a deinite distinction between latte/non-latte regions, if it is closer to 50% than 100% we may question whether the latte line model reflect a difference in life style.

## 2. Use of data to evaluate the 'latte line'
The following data sets would come handy:
1. A list of neighborhoods of Sydney
2. Population size of neighbourhoods
3. Foursquare API with venues

For practical reasons the list of neighborhoods will be taken from the Australian Bureau of Statistics, since we can get population size (number of residents) at the same granularity.
Neighbourhoods are going to be assigned with their lat/log coordinates.
Neighbourhoods will be labeled on which side they are of the latte line. For the geographical identification of the latte line we use a report of the Australian national news agency, abc.com.au:
https://www.abc.net.au/news/2019-12-17/sydneys-latte-line-divides-job-and-housing-opportunities/11803706

Foursquare API will be used to fetch venues for the analysed geographical areas.
Since in this particular case life style is the subject of analysis the number of venues will be harmonised on population density: venue count * 1000 / (residents / sqms).

The goal of the analysis is building a model to predict if a neighbourhood based on the life style represented by venues belongs to the up or down side of the latte line with at least 75% accuracy.

Setting the accuracy treshold is arbitrary. While we cannot expect foursquare data to make a deinite distinction between latte/non-latte regions, if it is closer to 50% than 100% we may question whether the latte line model reflect a difference in life style.


In [214]:
%%markdown
# 3. Defining the latte line
Using the link in the previous section we can identify the suggested location of the theoretical straight line.
Note: a map is a 2D projection of a 3D planet, thus using lat/long coordinates and linear regression is methodoligically not entirely correct. However, considering the arbitrariness of the latte line, this error does not impact the success of labeling.
    
SW: -34.001321, 151.235472
    
NE: -33.627029, 150.662457
    
Using y = ax + b with both data pair would give us the value of a and b

-34.001321 = a * 151.235472 + b

-33.627029 = a * 150.662457 + b

-34.001321 + -33.627029 = a * (151.235472 - 150.662457)


# 3. Defining the latte line
Using the link in the previous section we can identify the suggested location of the theoretical straight line.
Note: a map is a 2D projection of a 3D planet, thus using lat/long coordinates and linear regression is methodoligically not entirely correct. However, considering the arbitrariness of the latte line, this error does not impact the success of labeling.
    
SW: -34.001321, 151.235472
    
NE: -33.627029, 150.662457
    
Using y = ax + b with both data pair would give us the value of a and b

-34.001321 = a * 151.235472 + b

-33.627029 = a * 150.662457 + b

-34.001321 + -33.627029 = a * (151.235472 - 150.662457)


In [215]:
left = -34.001321 - (-33.627029)
right = 151.235472 - 150.662457
a = left/right
print("Value of a:")
print(a)
b1 = -34.001321 - (a * 151.235472)
b2 = -33.627029 - (a * 150.662457)
if b1 != b2:
    print("Calculation error")
    end
print("Value of b:")
print(b1)

Value of a:
-0.6531975602732882
Value of b:
64.7853203371792


In [216]:
%%markdown
Now we can evaluate the neighbourhoods, if lat >= a * long + b then the suburb is above or on the latte line, otherwise below.

Checking with Bankstown and Castle Hill, using approximate lat/long values

Bankstown: -33.916168, 151.033490 #should be below the line, thus false

Castle Hill: -33.729117, 151.005955 #should be above the line, thus true

Now we can evaluate the neighbourhoods, if lat >= a * long + b then the suburb is above or on the latte line, otherwise below.

Checking with Bankstown and Castle Hill, using approximate lat/long values

Bankstown: -33.916168, 151.033490 #should be below the line, thus false

Castle Hill: -33.729117, 151.005955 #should be above the line, thus true


In [217]:
def getLatte(lat, long):
    a = -0.6531975602732882
    b = 64.7853203371792
    if (lat >= a * long + b):
        return True
    else:
        return False

In [218]:
Bankstown = getLatte(-33.916168, 151.033490)
Bankstown

False

In [219]:
Castle_Hill = getLatte(-33.729117, 151.005955)
Castle_Hill

True

In [220]:
%%markdown
Our geographical latte labeling function is verified.

Next: neighbourhood and population data from the Australian Statistical Bureau, 2016 census (abs.gov.au)
1. Population data of New South Wales on suburb level
2. Suburb dictionary

These are downloaded from the ABS data centre and imported to the platform manully.

The list of suburbs is from an excel file downloaded from the NSW goverment's website.

Our geographical latte labeling function is verified.

Next: neighbourhood and population data from the Australian Statistical Bureau, 2016 census (abs.gov.au)
1. Population data of New South Wales on suburb level
2. Suburb dictionary

These are downloaded from the ABS data centre and imported to the platform manully.

The list of suburbs is from an excel file downloaded from the NSW goverment's website.


In [221]:

import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_5063675a2a854501868f895828b10fa6 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='TMKJGv8rz7ULqah55iF4nPj9Xcn4Y1nh-MzUTlCImNXS',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_5063675a2a854501868f895828b10fa6.get_object(Bucket='cloudera8-donotdelete-pr-650f7celdtersp',Key='census-population.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_1 = pd.read_csv(body)
df_data_1.head()


Unnamed: 0,SSC_CODE_2016,Tot_P_M,Tot_P_F,Tot_P_P,Age_0_4_yr_M,Age_0_4_yr_F,Age_0_4_yr_P,Age_5_14_yr_M,Age_5_14_yr_F,Age_5_14_yr_P,...,High_yr_schl_comp_Yr_8_belw_P,High_yr_schl_comp_D_n_g_sch_M,High_yr_schl_comp_D_n_g_sch_F,High_yr_schl_comp_D_n_g_sch_P,Count_psns_occ_priv_dwgs_M,Count_psns_occ_priv_dwgs_F,Count_psns_occ_priv_dwgs_P,Count_Persons_other_dwgs_M,Count_Persons_other_dwgs_F,Count_Persons_other_dwgs_P
0,SSC10001,12,12,22,0,0,0,0,0,0,...,0,0,0,0,9,6,21,0,3,4
1,SSC10002,2076,2177,4253,108,109,214,258,277,537,...,195,28,43,69,2034,2141,4175,44,37,79
2,SSC10003,2542,2838,5373,156,149,306,268,261,529,...,217,10,23,36,2338,2625,4962,164,184,355
3,SSC10004,535,579,1109,30,26,58,95,123,222,...,26,0,0,0,506,554,1056,15,17,38
4,SSC10005,13,7,22,0,0,0,0,0,0,...,0,0,0,0,13,9,24,0,0,0


In [222]:

body = client_5063675a2a854501868f895828b10fa6.get_object(Bucket='cloudera8-donotdelete-pr-650f7celdtersp',Key='ssc_codes.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_2 = pd.read_csv(body)
df_data_2.head()


Unnamed: 0,ASGS_Structure,Census_Code_2016,ASGS_Code_2016,Census_Name_2016,Area sqkm
0,AUS,036,36,AUSTRALIA,7688126.0
1,CED,CED101,101,Banks,49.446
2,CED,CED102,102,Barton,39.6466
3,CED,CED103,103,Bennelong,58.6052
4,CED,CED104,104,Berowra,749.6359


In [223]:
%%markdown
Let's get rid of the data we do not need for the analysis

Let's get rid of the data we do not need for the analysis


In [224]:
df_pop = df_data_1[['SSC_CODE_2016','Tot_P_M']]
df_pop.head()

Unnamed: 0,SSC_CODE_2016,Tot_P_M
0,SSC10001,12
1,SSC10002,2076
2,SSC10003,2542
3,SSC10004,535
4,SSC10005,13


In [225]:
df_sub_dict = df_data_2[['Census_Code_2016', 'Census_Name_2016',  'Area sqkm']]
df_sub_dict.head()

Unnamed: 0,Census_Code_2016,Census_Name_2016,Area sqkm
0,036,AUSTRALIA,7688126.0
1,CED101,Banks,49.446
2,CED102,Barton,39.6466
3,CED103,Bennelong,58.6052
4,CED104,Berowra,749.6359


In [226]:
%%markdown
Let's assign suburb names from the dictionary to the suburb dataframe. We can assume every code should have a value in dictionary, thus 'inner join' is appropriate. Note, if some suburbs get lost, our analysis will be still valid.

Let's assign suburb names from the dictionary to the suburb dataframe. We can assume every code should have a value in dictionary, thus 'inner join' is appropriate. Note, if some suburbs get lost, our analysis will be still valid.


In [227]:
df_suburbs = df_pop.join(df_sub_dict.set_index('Census_Code_2016'), on='SSC_CODE_2016', how='inner')
df_suburbs.head()

Unnamed: 0,SSC_CODE_2016,Tot_P_M,Census_Name_2016,Area sqkm
0,SSC10001,12,Aarons Pass,82.7639
1,SSC10002,2076,Abbotsbury,4.9788
2,SSC10003,2542,Abbotsford (NSW),1.018
3,SSC10004,535,Abercrombie,2.9775
4,SSC10005,13,Abercrombie River,127.1701


In [228]:
%%markdown
Let's remove (NSW)

Let's remove (NSW)


In [229]:
for i in range(0, len(df_suburbs)):
    nsw_index = df_suburbs.iloc[i][2].find('(NSW)')
    if nsw_index > -1:
        df_suburbs.loc[i, 'Census_Name_2016'] = (df_suburbs.iloc[i][2])[0:nsw_index]
       

In [230]:
%%markdown
Let's check further parentheses

Let's check further parentheses


In [231]:
df_suburbs[df_suburbs['Census_Name_2016'].str.contains("\(")]

Unnamed: 0,SSC_CODE_2016,Tot_P_M,Census_Name_2016,Area sqkm
33,SSC10034,44,Alison (Central Coast - NSW),1.3958
34,SSC10035,43,Alison (Dungog - NSW),19.2707
122,SSC10123,28,Back Creek (Bland - NSW),449.7950
123,SSC10124,9,Back Creek (Gwydir - NSW),119.9514
124,SSC10125,13,Back Creek (Mid-Coast - NSW),31.5289
125,SSC10126,3,Back Creek (Queanbeyan-Palerang Regional - NSW),39.2653
126,SSC10127,8,Back Creek (Tenterfield - NSW),200.7410
127,SSC10128,9,Back Creek (Tweed - NSW),7.4364
138,SSC10139,22,Bakers Creek (Mid-Coast - NSW),40.4685
139,SSC10140,7,Bakers Creek (Nambucca - NSW),57.2165


In [232]:
%%markdown
There are still some values. Now the difference is made between NSW regions. 
Considerations: if we remove these added remarks, we may create false mathces to Sydney suburbs.
If we leave the remarks, we may lose some of the suburbs. False matches would be worse since they can distort the observed population.
Thus let's leave it as is.

Move on with fetching the suburb list.

There are still some values. Now the difference is made between NSW regions. 
Considerations: if we remove these added remarks, we may create false mathces to Sydney suburbs.
If we leave the remarks, we may lose some of the suburbs. False matches would be worse since they can distort the observed population.
Thus let's leave it as is.

Move on with fetching the suburb list.


In [233]:

body = client_5063675a2a854501868f895828b10fa6.get_object(Bucket='cloudera8-donotdelete-pr-650f7celdtersp',Key='suburb-list.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_3 = pd.read_csv(body)
df_data_3.head()


Unnamed: 0,Suburb
0,Abbotsbury
1,Abbotsford
2,Acacia Gardens
3,Agnes Banks
4,Airds


In [234]:
%%markdown
Let's merge with the statistical bureau data, and check the size again

Let's merge with the statistical bureau data, and check the size again


In [235]:
df_sydney_suburbs = df_data_3.join(df_suburbs.set_index('Census_Name_2016'), on='Suburb', how='inner')
df_sydney_suburbs.head()


Unnamed: 0,Suburb,SSC_CODE_2016,Tot_P_M,Area sqkm
0,Abbotsbury,SSC10002,2076,4.9788
2,Acacia Gardens,SSC10014,1898,1.0013
3,Agnes Banks,SSC10021,472,15.475
4,Airds,SSC10022,1333,2.3808
5,Alexandria,SSC10030,4214,3.5156


In [236]:
df_sydney_suburbs = df_sydney_suburbs.dropna()
df_sydney_suburbs.reset_index(drop=True)

Unnamed: 0,Suburb,SSC_CODE_2016,Tot_P_M,Area sqkm
0,Abbotsbury,SSC10002,2076,4.9788
1,Acacia Gardens,SSC10014,1898,1.0013
2,Agnes Banks,SSC10021,472,15.4750
3,Airds,SSC10022,1333,2.3808
4,Alexandria,SSC10030,4214,3.5156
5,Alfords Point,SSC10031,1524,2.5665
6,Allambie Heights,SSC10036,3404,6.6746
7,Allawah,SSC10038,2857,0.5837
8,Ambarvale,SSC10049,3598,2.8822
9,Annangrove,SSC10062,724,10.2464


In [237]:
%%markdown
Working on the environment for further steps

Working on the environment for further steps


In [238]:
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')


# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
import json # library to handle JSON files
print('Libraries imported.')


Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Folium installed
Libraries imported.


In [239]:
%%markdown
Let's define a geolocator function

Let's define a geolocator function


In [240]:
def geo_suburb(address):
    try:
        geolocator = Nominatim(user_agent="new_app")
        location = geolocator.geocode(address)
        #if location is None :
        #    return 0, 0
        return location.latitude, location.longitude
    except:
        geo_suburb(address)
       


In [241]:
%%markdown
Let's add latitude and longitude columns to the dataframe, and check

Let's add latitude and longitude columns to the dataframe, and check


In [242]:
df_sydney_suburbs['lat'] = "0"
df_sydney_suburbs['long'] = "0"
df_sydney_suburbs.head()                  

Unnamed: 0,Suburb,SSC_CODE_2016,Tot_P_M,Area sqkm,lat,long
0,Abbotsbury,SSC10002,2076,4.9788,0,0
2,Acacia Gardens,SSC10014,1898,1.0013,0,0
3,Agnes Banks,SSC10021,472,15.475,0,0
4,Airds,SSC10022,1333,2.3808,0,0
5,Alexandria,SSC10030,4214,3.5156,0,0


In [243]:
df_sydney_suburbs.tail()  

Unnamed: 0,Suburb,SSC_CODE_2016,Tot_P_M,Area sqkm,lat,long
664,Yagoona,SSC14453,8999,4.618,0,0
665,Yarrawarrah,SSC14485,1334,1.2438,0,0
666,Yennora,SSC14506,804,2.7387,0,0
667,Yowie Bay,SSC14519,1497,1.2045,0,0
668,Zetland,SSC14524,5027,0.8048,0,0


In [244]:
#del df_sydney_suburbs['latte']
#df_sydney_suburbs = df_sydney_suburbs.dropna()
#df_sydney_suburbs.reset_index(drop=True)

In [246]:
for i in range(0, len(df_sydney_suburbs)):
    print(df_sydney_suburbs.iat[i, 0])
    if (float(df_sydney_suburbs.iat[i, 4]) == 0):
        print("in update")
        longitude = 0
        attempt =0
        while (longitude == 0):
            if (attempt > 0 and attempt%50==0):
                print("50 repeat for " + df_sydney_suburbs.iat[i, 0])
            #coords = None
            #while (coords is None):
            latitude, longitude = geo_suburb(df_sydney_suburbs.iat[i, 0] + ", Australia")
            #latitude, longitude = coords
            #print(latitude)
            df_sydney_suburbs.iat[i, 0]
            #print("before:" + str(df_sydney_suburbs.iat[i, 3]))
            df_sydney_suburbs.iat[i, 4] = latitude
            #print("after:" + str(df_sydney_suburbs.iat[i, 3]))
            df_sydney_suburbs.iat[i, 5] = longitude
            attempt += 1
            if (attempt == 100):
                break
print("Loop ended")

Abbotsbury
in update
Acacia Gardens
in update
Agnes Banks
in update
Airds
in update
Alexandria
in update
Alfords Point
in update
Allambie Heights
in update
Allawah
in update
Ambarvale
in update
Annangrove
in update
Arncliffe
in update
Arndell Park
in update
Artarmon
in update
Ashbury
in update
Ashcroft
in update
Asquith
in update
Austinmer
in update
Austral
in update
Avalon Beach
in update
Badgerys Creek
in update
Balgowlah
in update
Balgowlah Heights
in update
Balmain
in update
Balmain East
in update
Banksia
in update
Banksmeadow
in update
Bankstown
in update
Bankstown Aerodrome
in update
Barangaroo
in update
Barden Ridge
in update
Bardia
in update
Bardwell Park
in update
Bardwell Valley
in update
Bass Hill
in update
Baulkham Hills
in update
Beacon Hill
in update
Beaumont Hills
in update
Beecroft
in update
Belfield
in update
Bella Vista
in update
Bellevue Hill
in update
Belmore
in update
Belrose
in update
Berala
in update
Berkshire Park
in update
Berowra
in update
Berowra Creek
in upd

In [247]:
#df_sydney_suburbs = df_sydney_suburbs.dropna()
#df_sydney_suburbs.reset_index(drop=True)
df_sydney_suburbs.head()

Unnamed: 0,Suburb,SSC_CODE_2016,Tot_P_M,Area sqkm,lat,long
0,Abbotsbury,SSC10002,2076,4.9788,-33.8693,150.867
2,Acacia Gardens,SSC10014,1898,1.0013,-33.7325,150.913
3,Agnes Banks,SSC10021,472,15.475,-33.6145,150.711
4,Airds,SSC10022,1333,2.3808,-34.09,150.826
5,Alexandria,SSC10030,4214,3.5156,-33.9092,151.192


In [248]:
df_sydney_suburbs.tail()

Unnamed: 0,Suburb,SSC_CODE_2016,Tot_P_M,Area sqkm,lat,long
664,Yagoona,SSC14453,8999,4.618,-33.9038,151.018
665,Yarrawarrah,SSC14485,1334,1.2438,-34.0567,151.031
666,Yennora,SSC14506,804,2.7387,-33.862,150.969
667,Yowie Bay,SSC14519,1497,1.2045,-34.0503,151.103
668,Zetland,SSC14524,5027,0.8048,-33.9077,151.208


In [249]:
df_sydney_suburbs = df_sydney_suburbs.assign(latte = "")


In [250]:
df_sydney_suburbs.head()

Unnamed: 0,Suburb,SSC_CODE_2016,Tot_P_M,Area sqkm,lat,long,latte
0,Abbotsbury,SSC10002,2076,4.9788,-33.8693,150.867,
2,Acacia Gardens,SSC10014,1898,1.0013,-33.7325,150.913,
3,Agnes Banks,SSC10021,472,15.475,-33.6145,150.711,
4,Airds,SSC10022,1333,2.3808,-34.09,150.826,
5,Alexandria,SSC10030,4214,3.5156,-33.9092,151.192,


In [251]:
df_sydney_suburbs.tail()

Unnamed: 0,Suburb,SSC_CODE_2016,Tot_P_M,Area sqkm,lat,long,latte
664,Yagoona,SSC14453,8999,4.618,-33.9038,151.018,
665,Yarrawarrah,SSC14485,1334,1.2438,-34.0567,151.031,
666,Yennora,SSC14506,804,2.7387,-33.862,150.969,
667,Yowie Bay,SSC14519,1497,1.2045,-34.0503,151.103,
668,Zetland,SSC14524,5027,0.8048,-33.9077,151.208,


In [252]:
df_sydney_suburbs = df_sydney_suburbs.dropna()
df_sydney_suburbs.reset_index(drop=True)

Unnamed: 0,Suburb,SSC_CODE_2016,Tot_P_M,Area sqkm,lat,long,latte
0,Abbotsbury,SSC10002,2076,4.9788,-33.8693,150.867,
1,Acacia Gardens,SSC10014,1898,1.0013,-33.7325,150.913,
2,Agnes Banks,SSC10021,472,15.4750,-33.6145,150.711,
3,Airds,SSC10022,1333,2.3808,-34.09,150.826,
4,Alexandria,SSC10030,4214,3.5156,-33.9092,151.192,
5,Alfords Point,SSC10031,1524,2.5665,-33.9839,151.024,
6,Allambie Heights,SSC10036,3404,6.6746,-33.7705,151.25,
7,Allawah,SSC10038,2857,0.5837,-31.9288,149.513,
8,Ambarvale,SSC10049,3598,2.8822,-34.0844,150.802,
9,Annangrove,SSC10062,724,10.2464,-33.6575,150.946,


In [253]:
%%markdown
Now using the latte function let's assign the label to every record

Now using the latte function let's assign the label to every record


In [254]:
for i in range(0, len(df_sydney_suburbs)):
    print(df_sydney_suburbs.iat[i, 0])
    df_sydney_suburbs.iat[i, 6] = getLatte(float(df_sydney_suburbs.iat[i, 4]), float(df_sydney_suburbs.iat[i, 5]))
   

Abbotsbury
Acacia Gardens
Agnes Banks
Airds
Alexandria
Alfords Point
Allambie Heights
Allawah
Ambarvale
Annangrove
Arncliffe
Arndell Park
Artarmon
Ashbury
Ashcroft
Asquith
Austinmer
Austral
Avalon Beach
Badgerys Creek
Balgowlah
Balgowlah Heights
Balmain
Balmain East
Banksia
Banksmeadow
Bankstown
Bankstown Aerodrome
Barangaroo
Barden Ridge
Bardia
Bardwell Park
Bardwell Valley
Bass Hill
Baulkham Hills
Beacon Hill
Beaumont Hills
Beecroft
Belfield
Bella Vista
Bellevue Hill
Belmore
Belrose
Berala
Berkshire Park
Berowra
Berowra Creek
Berowra Heights
Berowra Waters
Berrilee
Beverley Park
Beverly Hills
Bexley
Bexley North
Bilgola Beach
Bilgola Plateau
Birchgrove
Birrong
Blackett
Blacktown
Blairmount
Blakehurst
Bligh Park
Bondi
Bondi Beach
Bondi Junction
Bonnet Bay
Bonnyrigg
Bonnyrigg Heights
Bossley Park
Botany
Bow Bowing
Breakfast Point
Bringelly
Bronte
Brookvale
Bundeena
Bungarribee
Burraneer
Burwood Heights
Busby
Cabramatta
Cabramatta West
Caddens
Cambridge Gardens
Cambridge Park
Camellia
C

In [255]:
df_sydney_suburbs

Unnamed: 0,Suburb,SSC_CODE_2016,Tot_P_M,Area sqkm,lat,long,latte
0,Abbotsbury,SSC10002,2076,4.9788,-33.8693,150.867,False
2,Acacia Gardens,SSC10014,1898,1.0013,-33.7325,150.913,True
3,Agnes Banks,SSC10021,472,15.4750,-33.6145,150.711,True
4,Airds,SSC10022,1333,2.3808,-34.09,150.826,False
5,Alexandria,SSC10030,4214,3.5156,-33.9092,151.192,True
6,Alfords Point,SSC10031,1524,2.5665,-33.9839,151.024,False
7,Allambie Heights,SSC10036,3404,6.6746,-33.7705,151.25,True
8,Allawah,SSC10038,2857,0.5837,-31.9288,149.513,True
9,Ambarvale,SSC10049,3598,2.8822,-34.0844,150.802,False
11,Annangrove,SSC10062,724,10.2464,-33.6575,150.946,True


In [53]:
#df_sydney_suburbs = df_sydney_suburbs.dropna()
df_sydney_suburbs.reset_index(drop=True)

Unnamed: 0,Suburb,SSC_CODE_2016,Tot_P_M,Area sqkm,lat,long,latte
0,Abbotsbury,SSC10002,2076,4.9788,-33.8693,150.867,False
1,Acacia Gardens,SSC10014,1898,1.0013,-33.7325,150.913,True
2,Agnes Banks,SSC10021,472,15.4750,-33.6145,150.711,True
3,Airds,SSC10022,1333,2.3808,-34.09,150.826,False
4,Alexandria,SSC10030,4214,3.5156,-33.9092,151.192,True
5,Alfords Point,SSC10031,1524,2.5665,-33.9839,151.024,False
6,Allambie Heights,SSC10036,3404,6.6746,-33.7705,151.25,True
7,Allawah,SSC10038,2857,0.5837,-31.9288,149.513,True
8,Ambarvale,SSC10049,3598,2.8822,-34.0844,150.802,False
9,Annangrove,SSC10062,724,10.2464,-33.6575,150.946,True


In [256]:
%%markdown
Calculating density for venue count normalisation

Calculating density for venue count normalisation


In [257]:

df_sydney_suburbs['density'] = df_sydney_suburbs['Tot_P_M']/df_sydney_suburbs['Area sqkm']

In [258]:
df_sydney_suburbs.head()

Unnamed: 0,Suburb,SSC_CODE_2016,Tot_P_M,Area sqkm,lat,long,latte,density
0,Abbotsbury,SSC10002,2076,4.9788,-33.8693,150.867,False,416.967944
2,Acacia Gardens,SSC10014,1898,1.0013,-33.7325,150.913,True,1895.535803
3,Agnes Banks,SSC10021,472,15.475,-33.6145,150.711,True,30.500808
4,Airds,SSC10022,1333,2.3808,-34.09,150.826,False,559.895833
5,Alexandria,SSC10030,4214,3.5156,-33.9092,151.192,True,1198.657413


In [259]:
%%markdown
Now that we have 530 suburbs with population, area size, lat/long and latte label, we can assign venue data from Foursquare.

Now that we have 530 suburbs with population, area size, lat/long and latte label, we can assign venue data from Foursquare.


In [260]:
CLIENT_ID = 'TKRJ23UTGWHGB3O3DA0S3I03JC0LU5UIAFZULNCB33LRTBC4' # your Foursquare ID
CLIENT_SECRET = '015DRRTPU1ZNYXZGHAKR2UW22SYMSPRZI0OKOLZ2HQEST3FE' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [261]:
import requests

In [262]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT = 500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [264]:
sydney_venues = getNearbyVenues(names=df_sydney_suburbs['Suburb'],
                                   latitudes=df_sydney_suburbs['lat'],
                                   longitudes=df_sydney_suburbs['long']
                                  )

Abbotsbury
Acacia Gardens
Agnes Banks
Airds
Alexandria
Alfords Point
Allambie Heights
Allawah
Ambarvale
Annangrove
Arncliffe
Arndell Park
Artarmon
Ashbury
Ashcroft
Asquith
Austinmer
Austral
Avalon Beach
Badgerys Creek
Balgowlah
Balgowlah Heights
Balmain
Balmain East
Banksia
Banksmeadow
Bankstown
Bankstown Aerodrome
Barangaroo
Barden Ridge
Bardia
Bardwell Park
Bardwell Valley
Bass Hill
Baulkham Hills
Beacon Hill
Beaumont Hills
Beecroft
Belfield
Bella Vista
Bellevue Hill
Belmore
Belrose
Berala
Berkshire Park
Berowra
Berowra Creek
Berowra Heights
Berowra Waters
Berrilee
Beverley Park
Beverly Hills
Bexley
Bexley North
Bilgola Beach
Bilgola Plateau
Birchgrove
Birrong
Blackett
Blacktown
Blairmount
Blakehurst
Bligh Park
Bondi
Bondi Beach
Bondi Junction
Bonnet Bay
Bonnyrigg
Bonnyrigg Heights
Bossley Park
Botany
Bow Bowing
Breakfast Point
Bringelly
Bronte
Brookvale
Bundeena
Bungarribee
Burraneer
Burwood Heights
Busby
Cabramatta
Cabramatta West
Caddens
Cambridge Gardens
Cambridge Park
Camellia
C

In [265]:
print(sydney_venues.shape)
sydney_venues.head()

(4361, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Abbotsbury,-33.869285,150.866703,Foodworks,-33.869191,150.865388,Supermarket
1,Abbotsbury,-33.869285,150.866703,Abbotsbury Shops,-33.869554,150.865294,Convenience Store
2,Abbotsbury,-33.869285,150.866703,Stockdale Reserve,-33.871098,150.86806,Park
3,Abbotsbury,-33.869285,150.866703,817 bus stop,-33.869773,150.870122,Bus Station
4,Alexandria,-33.909157,151.192128,The Grounds Of Alexandria,-33.910774,151.194406,Café


In [266]:
%%markdown
Grouping the venues based on their type, getting the count of specific venue types for each suburb

Grouping the venues based on their type, getting the count of specific venue types for each suburb


In [267]:
typecount = sydney_venues.groupby(['Neighborhood', 'Venue Category']).size()
type(typecount)
venue_count = pd.DataFrame(typecount).reset_index()
venue_count.head()

Unnamed: 0,Neighborhood,Venue Category,0
0,Abbotsbury,Bus Station,1
1,Abbotsbury,Convenience Store,1
2,Abbotsbury,Park,1
3,Abbotsbury,Supermarket,1
4,Alexandria,Australian Restaurant,1


In [268]:
venue_count.columns = ['Suburb', 'venue_cat', 'venue_count']
venue_count.head()

Unnamed: 0,Suburb,venue_cat,venue_count
0,Abbotsbury,Bus Station,1
1,Abbotsbury,Convenience Store,1
2,Abbotsbury,Park,1
3,Abbotsbury,Supermarket,1
4,Alexandria,Australian Restaurant,1


In [269]:
%%markdown
Let's transpose the dataframe to reflect the number of venue type counts per suburb.
The initial stub will be the first suburb's name, and a simple algorithm will process the original table to add the new records.

Let's transpose the dataframe to reflect the number of venue type counts per suburb.
The initial stub will be the first suburb's name, and a simple algorithm will process the original table to add the new records.


In [270]:
venue_tr = venue_count.head(1)[['Suburb']]
venue_tr

Unnamed: 0,Suburb
0,Abbotsbury


In [271]:
current_sub = venue_tr.iat[0, 0]
tr_index = 0
for i in range(0, len(venue_count)):
    # check if suburb is the same as last, if not, add new row and set the value of the suburb
    if (venue_count.iat[i, 0] != current_sub):
        venue_tr = venue_tr.append(pd.Series([np.nan]), ignore_index = True)
        tr_index += 1
        current_sub = venue_count.iat[i, 0]
        venue_tr.iat[tr_index, 0] = current_sub
    # check if the new matrix already has a column for the venue type, if not, add it
    if str(venue_count.iat[i, 1]) not in venue_tr:
        venue_tr[str(venue_count.iat[i, 1])] = "0"
        #print(str(venue_count.iat[i, 1]))
    # update count of venue type
    column_index = venue_tr.columns.get_loc(str(venue_count.iat[i, 1]))
    venue_tr.iat[tr_index, column_index] = venue_count.iat[i, 2]
del venue_tr[0]
venue_tr = venue_tr.fillna(0)

In [272]:
df_labeler = df_sydney_suburbs[['Suburb', 'latte', 'density']]
df_labeler.head()

Unnamed: 0,Suburb,latte,density
0,Abbotsbury,False,416.967944
2,Acacia Gardens,True,1895.535803
3,Agnes Banks,True,30.500808
4,Airds,False,559.895833
5,Alexandria,True,1198.657413


In [273]:
df_labeled_venue_set = venue_tr.join(df_labeler.set_index('Suburb'), on='Suburb', how='inner')
df_labeled_venue_set.head()

Unnamed: 0,Suburb,Bus Station,Convenience Store,Park,Supermarket,Australian Restaurant,Bar,Basketball Court,Café,Flea Market,...,Port,Tailor Shop,Building,Antique Shop,Bagel Shop,Grilled Meat Restaurant,Sri Lankan Restaurant,Tunnel,latte,density
0,Abbotsbury,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,False,416.967944
1,Alexandria,0,0,0,0,1,1,1,5,1,...,0,0,0,0,0,0,0,0,True,1198.657413
2,Alfords Point,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,False,593.804793
3,Allambie Heights,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,True,509.993108
4,Ambarvale,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,False,1248.351953


In [274]:
for i in range (0, len(df_labeled_venue_set)):
    for col_name in list(df_labeled_venue_set):
        if (col_name != 'Suburb' and col_name !='latte' and col_name !='density'):
            density = float(df_labeled_venue_set.iat[i, df_labeled_venue_set.columns.get_loc('density')])
            if (density > 0):
                new_val = float(df_labeled_venue_set.iat[i, df_labeled_venue_set.columns.get_loc(col_name)]) * 1000 / density
                df_labeled_venue_set.iat[i, df_labeled_venue_set.columns.get_loc(col_name)] = new_val


In [275]:
df_labeled_venue_set.head()

Unnamed: 0,Suburb,Bus Station,Convenience Store,Park,Supermarket,Australian Restaurant,Bar,Basketball Court,Café,Flea Market,...,Port,Tailor Shop,Building,Antique Shop,Bagel Shop,Grilled Meat Restaurant,Sri Lankan Restaurant,Tunnel,latte,density
0,Abbotsbury,2,2,2,2,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,False,416.967944
1,Alexandria,0,0,0,0,0.834267,0.834267,0.834267,4.17133,0.834267,...,0,0,0,0,0,0,0,0,True,1198.657413
2,Alfords Point,0,0,0,0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,False,593.804793
3,Allambie Heights,0,0,0,0,0.0,0.0,0.0,1.96081,0.0,...,0,0,0,0,0,0,0,0,True,509.993108
4,Ambarvale,0,0,0,0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,False,1248.351953


In [276]:
%%markdown
Now prepare our sets for machine learning

Now prepare our sets for machine learning


In [277]:
df_labeled_venue_set.head()

Unnamed: 0,Suburb,Bus Station,Convenience Store,Park,Supermarket,Australian Restaurant,Bar,Basketball Court,Café,Flea Market,...,Port,Tailor Shop,Building,Antique Shop,Bagel Shop,Grilled Meat Restaurant,Sri Lankan Restaurant,Tunnel,latte,density
0,Abbotsbury,2,2,2,2,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,False,416.967944
1,Alexandria,0,0,0,0,0.834267,0.834267,0.834267,4.17133,0.834267,...,0,0,0,0,0,0,0,0,True,1198.657413
2,Alfords Point,0,0,0,0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,False,593.804793
3,Allambie Heights,0,0,0,0,0.0,0.0,0.0,1.96081,0.0,...,0,0,0,0,0,0,0,0,True,509.993108
4,Ambarvale,0,0,0,0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,False,1248.351953


In [278]:
%%markdown
Creating X, the independend variables and y the dependent variable

Creating X, the independend variables and y the dependent variable


In [279]:
X = df_labeled_venue_set.copy()
del X['Suburb']
del X['latte']
del X['density']
df_labeled_venue_set["latte"] = pd.to_numeric(df_labeled_venue_set["latte"])
y = df_labeled_venue_set['latte'].values

In [280]:
%%markdown
Normalizing the dataset

Normalizing the dataset


In [281]:
from sklearn import preprocessing
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

  return self.partial_fit(X, y)
  from ipykernel import kernelapp as app


array([[ 4.68537981,  1.26087956, -0.01574635, ..., -0.0469841 ,
        -0.0469841 , -0.0469841 ],
       [-0.10552657, -0.07815979, -0.07929162, ..., -0.0469841 ,
        -0.0469841 , -0.0469841 ],
       [-0.10552657, -0.07815979, -0.07929162, ..., -0.0469841 ,
        -0.0469841 , -0.0469841 ],
       [-0.10552657, -0.07815979, -0.07929162, ..., -0.0469841 ,
        -0.0469841 , -0.0469841 ],
       [-0.10552657, -0.07815979, -0.07929162, ..., -0.0469841 ,
        -0.0469841 , -0.0469841 ]])

In [283]:
%%markdown
Split the data to train and test sets

Split the data to train and test sets


In [308]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=3)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (340, 313) (340,)
Test set: (114, 313) (114,)


In [309]:
X_train[0:5]

array([[ 2.28992662, -0.07815979, -0.07929162, ..., -0.0469841 ,
        -0.0469841 , -0.0469841 ],
       [-0.10552657, -0.07815979, -0.07929162, ..., -0.0469841 ,
        -0.0469841 , -0.0469841 ],
       [-0.10552657, -0.07815979, -0.07929162, ..., -0.0469841 ,
        -0.0469841 , -0.0469841 ],
       [-0.10552657, -0.07815979, -0.07929162, ..., -0.0469841 ,
        -0.0469841 , -0.0469841 ],
       [-0.10552657, -0.07815979, -0.07929162, ..., -0.0469841 ,
        -0.0469841 , -0.0469841 ]])

In [310]:
y[0:5]

array([False,  True, False,  True, False])

In [311]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier

In [312]:
%%markdown
### Training with KNeighborsClassifier

### Training with KNeighborsClassifier


In [313]:
#from sklearn.neighbors import KNeighborsClassifier
k = 8
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=8, p=2,
           weights='uniform')

In [314]:
yhat = neigh.predict(X_test)
yhat
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

Train set Accuracy:  0.6411764705882353
Test set Accuracy:  0.47368421052631576


In [315]:
Ks = 15
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    #mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
    mean_acc[n-1]=np.mean(yhat==y_test);
    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc

array([0.53508772, 0.44736842, 0.57894737, 0.48245614, 0.55263158,
       0.51754386, 0.54385965, 0.47368421, 0.51754386, 0.5       ,
       0.5877193 , 0.51754386, 0.57894737, 0.52631579])

In [316]:
%%markdown
Our best score, 0.5877193, slightly better than randomly labeling - failed.

### Test with DecisionTreeClassifier

Our best score, 0.5877193, slightly better than randomly labeling - failed.

### Test with DecisionTreeClassifier


In [317]:
from sklearn.tree import DecisionTreeClassifier
dTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
dTree # it shows the default parameters

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [318]:
dTree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [319]:
pTree = dTree.predict(X_test)
print (pTree [0:5])
print (y_test [0:5])

[ True  True  True  True  True]
[False False  True  True False]


In [320]:
from sklearn import metrics
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, pTree))


DecisionTrees's Accuracy:  0.631578947368421


In [321]:
%%markdown
Another poor result

### Next is SVM

Another poor result

### Next is SVM


In [322]:
from sklearn import svm
clf = svm.SVC()
clf.fit(X_train, y_train) 



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [323]:
yhat = clf.predict(X_test)
yhat

array([ True,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True])

In [324]:
print("SVM's Accuracy: ", metrics.accuracy_score(y_test, yhat))

SVM's Accuracy:  0.6140350877192983


In [325]:
%%markdown
SVM failed as well

### LogisticRegression

SVM failed as well

### LogisticRegression


In [326]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [327]:
yhatl = LR.predict(X_test)
yhatl

array([False, False,  True,  True, False, False, False, False, False,
        True,  True, False, False, False,  True,  True,  True,  True,
       False, False,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True, False,  True, False,  True,  True, False,
       False, False, False, False, False, False,  True,  True,  True,
       False, False,  True,  True,  True,  True,  True, False, False,
        True, False,  True,  True,  True, False,  True,  True,  True,
        True,  True, False,  True,  True,  True, False, False,  True,
       False,  True, False, False, False,  True,  True,  True, False,
       False, False,  True,  True, False, False,  True,  True,  True,
       False, False, False, False, False,  True, False, False,  True,
        True, False,  True,  True,  True, False, False, False, False,
        True,  True, False, False, False,  True])

In [328]:
print("LG's Accuracy: ", metrics.accuracy_score(y_test, yhatl))

LG's Accuracy:  0.5350877192982456


In [329]:
%%markdown
Logistic regression worked with low accuracy.

## Conclusion

Logistic regression worked with low accuracy.

## Conclusion


In [330]:
%%markdown

Foursquare data, at least with the approach taken, could not support major lifestyle difference on the 2 sides of the 'latte line'.

Potential reasons:
    - data was too sparse for the chosen geographical granularity. It is possible that on local government level (where each unit comprises 5-10 suburbs) Foursquare data would have been more cohrent.
    - the latte line in reality is more of a zig-zag, with pockets of posperous and less covetable zones on both sides.
    - perhaps the latte line is not a matter of life style, or venues or their presence on Foursquare does not determin life style.
    - the difference may exists but too much toned for our simple exercise, a combination of further data sets would be advantageous.


Due to Watson free tier limitations the story ends here.


Foursquare data, at least with the approach taken, could not support major lifestyle difference on the 2 sides of the 'latte line'.

Potential reasons:
    - data was too sparse for the chosen geographical granularity. It is possible that on local government level (where each unit comprises 5-10 suburbs) Foursquare data would have been more cohrent.
    - the latte line in reality is more of a zig-zag, with pockets of posperous and less covetable zones on both sides.
    - perhaps the latte line is not a matter of life style, or venues or their presence on Foursquare does not determin life style.
    - the difference may exists but too much toned for our simple exercise, a combination of further data sets would be advantageous.


Due to Watson free tier limitations the story ends here.
