# DS9-401 Capstone Introduction/Business Problem and Data
# Define the problem and the target audience

#### Import all libraries so the code to illustrate issues works throughout the notebook 

In [8]:
import pandas as pd

## Task
Clearly define a problem or an idea of your choice, where you would need to leverage the Foursquare location data to solve or execute. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem.

This submission will eventually become your Introduction/Business Problem section in your final report. So I recommend that you push the report (having your Introduction/Business Problem section only for now) to your Github repository and submit a link to it.

## Introduction
Finding a place to rent in Toronto is a painful process where rental unit availability can change in an hour. As a result tenants often feel pressured into making quick decisions that they might come to regret. There are a few factors impacting the decision: price, state of the unit, en-suite equipment, building amenities, number of washrooms, square footage, whether the unit is furnished, roommates if any, neighbors, proximity to public transit, garage or other parking facilities, unique factors such as pet ownership to name a few. All the mentioned factors can be grouped into categories: financial, unit-related, location-related, and personal.

Most of the choices for the factors can be stated explicitly and are already embedded in the search functionality  of major rental search websites. It's quite simple, really, to input how many bedrooms you as a tenant want, what your budget is, whether you will ever use the condo gym (aka do you care about amenities), do you need the unit furnished, whether you have pets, etc. And you have a list of available units. The question that comes unnoticed is the location. The net cast is either too wide or too narrow, rarely right. Yes, you do know how long you want to commute but what if the perfect neighborhood is just 5 min further? As you might have guessed, we will try to address this issue in the report. We will attempt to build a recommended list of neighborhoods.

## Target Audience and Impact Analysis
Stakeholder groups of this report are the rental market participants in Toronto. This problem is especially acute for tenants and rental brokers. It will also indirectly impact landlords because if solved could potentially decrease the churn rate.

#### Tenants
Tenants the main target audience of the report. They are the buyers in the market and have to define their preferences before hitting the search button. The first preference is the location. There are two potential variants of the problem: being too general with the location preferences and being too specific.

If the tenant is being too general, it might take a lot of time to narrow down or the decision will simply be made on the price, saving the budget but potentially making the life not as enjoyable. But if there is the shortlist of neighborhoods for the tenant to choose from, the trial-and-error process of looking for the right location can be streamlined.

On the other hand, if the tenant is too specific, they might miss the perfect unit simply because it is 500 meters further from the designated area. In this case the shortlist might expand the number of quality units for the tenant to choose from.

#### Rental brokers
The shortlist of neighborhoods for the tenant to choose from can provide a competitive advantage over the market, make the search process more efficient, increase the broker's capacity to service more clients, close more deals as the recommendations will make intuitive sense to the tenant, and finally happier customers that can spread the word through the grapevine increasing the number of prospects and making the business more sustainable in the long run.

#### Landlords
If the tenants are satisfied with all the factors feeding their decision, they are more likely to rent long-term barring any major life-altering events (marriage, kids, relocation for work). If the unit gets the new market price it might get an additional 1000 CAD/year but at a cost of a hassle of showings, background checks, negotiations, adjustments to the new tenants and a risk of the vacancy for a month which will undercut any benefits of taking the new market price. In addition, happy tenants means less complaints and less hassle long term.

## Business Problem
The business problem tackled in this report is the level of complexity of an implicit but important decision by the tenant of the rental property target location. Many factors can play a role but the decision is very binary: one either enjoys the neighborhood or does not. This project will aim to help answer the tenant's question: will I enjoy this neighborhood by providing a recommended shortlist of neighborhoods to explore. 

# Describe the Data Used to Solve the Problem

## Task
Describe the data that you will be using to solve the problem or execute your idea. Remember that you will need to use the Foursquare location data to solve the problem or execute your idea. You can absolutely use other datasets in combination with the Foursquare location data. So make sure that you provide adequate explanation and discussion, with examples, of the data that you will be using, even if it is only Foursquare location data.

This submission will eventually become your Data section in your final report. So I recommend that you push the report (having your Data section) to your Github repository and submit a link to it.

## Data
The data used to develop a solution to the issue at hand will be:
1. List of neighborhoods in Toronto grouped by first section (A1A) of the postal code scrapped from Wikipedia page and enriched with their respective location coordinates. A description of the dataset and the example of several observations are shown below.

In [11]:
#read 'tor.csv' stored from previous notebook into pandas DataFrame 'tor'

with open('tor.csv') as tor:
    df_tor = pd.read_csv(tor, index_col='Unnamed: 0') #set indexcolumn from csv

print('The dataframe has {} boroughs and {} neighborhood groups by postal codes.'.format(
        len(df_tor['Borough'].unique()),
        df_tor.shape[0]
    )
)    
    
df_tor.head()

The dataframe has 11 boroughs and 103 neighborhood groups by postal codes.


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


2. List of up to 150 most popular venues in 1500 m vicinity of the center of a given neighborhood group, attained from Foursquare. We will explain how the data was gathered in the methodology section. A description of the dataset and the example of several observations are shown below.

In [12]:
#read 'tor_venues.csv' stored from previous notebook into pandas DataFrame 'tor_venues'

with open('tor_venues.csv') as tor_v:
    tor_venues = pd.read_csv(tor_v, index_col='Unnamed: 0') #set indexcolumn from csv

print('The dataframe has {} groups of neighborhoods by postal code, {} unique venue categories, and {} venues.'.format(
        len(tor_venues['Neighborhood'].unique()),
        len(tor_venues['Venue Category'].unique()),
        tor_venues.shape[0]
    )
)    
    
tor_venues.head()

The dataframe has 103 groups of neighborhoods by postal code, 339 unique venue categories, and 6768 venues.


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
2,Parkwoods,43.753259,-79.329656,Donalda Golf & Country Club,43.752816,-79.342741,Golf Course
3,Parkwoods,43.753259,-79.329656,Tim Hortons,43.760668,-79.326368,Café
4,Parkwoods,43.753259,-79.329656,LCBO,43.757774,-79.314257,Liquor Store


3. An optional list of neighborhood groups prospective tenant enjoys and does not enjoy for their "vibe" (not for commute time or other logistical considerations which can be explicitly defined by the user later on once the shortlist is provided). If this optional list is not provided, then the renter will be given a clustered list of neighborhoods they can explore and chose a cluster of preference.
A description of the dataset and the example of several observations are shown below. Data points labeled '1' are neighborhoods user enjoys and '0' are neighborhoods user does not enjoy.

In [14]:
#read 'pref.csv' defined by user

with open('pref.csv') as prefer:
    pref = pd.read_csv(prefer, index_col='Unnamed: 0') #set indexcolumn from csv

print('The prospective tenant has defined {} neighborhood preferences.'.format(
        pref.shape[0]
    )
)    
    
pref.head()

The prospective tenant has defined 5 neighborhood preferences.


Unnamed: 0,PostalCode,Neighborhood,Label
0,M6G,Christie,0
1,M9P,Westmount,0
2,M5R,"The Annex, North Midtown, Yorkville",1
3,M5T,"Chinatown, Grange Park, Kensington Market",0
4,M5V,"CN Tower, Bathurst Quay, Island airport, Harbo...",1
