# Battle of the Neighborhoods: Introduction & Business Problem Sections

## 1. Introduction:

### The Business Problem:

In this project we create a fictional character named Jimmy that is looking to open a new gym somewhere in Toronto. Jimmy begins researching potential locations only to discover that Toronto is enormous. In fact, there are 140 different neighborhoods in Toronto! Jimmy is shocked but he knows he is going to have to choose one of these neighborhoods to open his new business. But which one should he choose and how in the world is he going to narrow down his options and choose the best one?

Jimmy suspects that the best neighborhood for his new gym will have the following characteristics:

* **High income:** This makes sense because people will presumably be able to purchase more gym memberships if they have more disposable income.
* **High population:** A higher population means more people to go to the gym.
* **Large area:** The bigger the area in terms of square kilometers the more room there is for a gym.
* **A low number of already existing gyms:** If there aren't very many gyms it means the demand for gyms hasn't yet been satisifed.

But these are just Jimmy's suspicions. In this research project, we are going to have to collect actual data and test these suggestions to find out how accurate they truly are.

It is going to be our job to look through the attributes of each Toronto neighborhood in order to locate the best neighborhood for Jimmy's new business. Let's get started!

### Who Would Be Interested in This Project?

**Business people:** Business people that are looking to open a new business would be very interested in this project. This project is useful to them because it uses data to show them where their business is most likely to succeed.

**Investors:** Investors are people that provide funds to business people so that they can purchase the necessary equipment and materials to get the business running. Investors hope that the business is successful so that they can make a healthy return on their investment. Data science projects like this one are useful to investors for the same reason they are useful to business people. They can help investors locate the best locations for investment.

**Academics:** Academics and researchers are interested in learning more about about the 'why' and the 'what' of different things. They would want to learn what the best locations are, why these locations are the best, and why this information is useful. Academics would then publish their research once they have answers to these questions.

**Data Scientists & Analysts:** Data scientists are the individuals being hired by the business people and investors to complete the project. They specialize in data analytics, statistics, and programming.

## 2. Data Acquisition and Cleaning

For this section we will be using the following data source:

### Neighbourhood Profiles 2016:

This dataset was obtained through the official government website of Toronto, *Toronto.ca*. The dataset is created once every five years based on the demographic features that are obtained through Canada's Census of Population every five years. The dataset includes a vast collection of data on the age, gender, race, location, income, language spoken, national origin, and many other useful characteristics of its residents. The 2016 dataset will be used for this project.

More information about the dataset can be found at the following links:  
https://www.toronto.ca/city-government/data-research-maps/neighbourhoods-communities/neighbourhood-profiles/  
https://open.toronto.ca/dataset/neighbourhood-profiles/

Importantly, the dataset contains the above features for each of Toronto's 140 neighborhoods. It provides us with a useful way to compare and contrast these neighborhoods as we analyze their similarities and differences in depth.

Overall, the data includes a stunning 2,383 features, which is obviously far too many to analyze in a data science project. In this project we will have to choose just a few features to make sure that our project is manageable and focused. Below are some features in the dataset that could be interesting as well as the rationale for including them:

### Best Dataset Features:

#### We will definitely include these ones in our analysis:
* **Population, 2016:** A higher population should lead to more gyms.
* **Total private dwellings:** A more wealthy population means more resources that can go towards gym memberships.
* **Land area in square kilometres:** A greater land area means more room for gym locations.
* **Children (0-14 years):** More children could mean more gyms because these children will become adults in a couple years and then be interested in going to the gym.
* **Youth (15-24 years):** Youth are especially interested in going to the gym and staying fit.
* **Working Age (25-54 years):** A higher population of these individuals means more gyms.
* **Pre-retirement (55-64 years):** A higher population of these individuals means more gyms.
* **Seniors (65+ years):** A higher population of these individuals means more gyms.
* **Older Seniors (85+ years):** A higher population of these individuals means more gyms.
* **Employment rate:** Higher employment means higher income, which means more money for gym memberships.
* **Unemployment rate:** Higher unemployment should reduce income, leading to less gyms.
* **Income taxes: Average amount (\$):** Income taxes lower disposable income available to spend on gym memberships.
* **After-tax income: Average amount (\$):** More income to spend on gym memberships.

#### We will possibly include these ones in our analysis:
* **No certificate, diploma or degree:** Maybe less educated individuals are less likely to go the gym?
* **Secondary (high) school diploma or equivalency certificate:** How likely to go to the gym are moderately educated individuals?
* **University certificate, diploma or degree at bachelor level or above:** How likely to go to the gym are highly educated individuals?

### Foursquare API:

The other major data source we will be using for this project is the Foursquare API. We will primarily be using this API to count the number of gyms that exist in each neighborhood. The number of gyms for each neighborhood will be our dependant variable, the one we are trying to predict.

### Next Steps:

For our next step, we are going to have to clean and wrangle the data. We are going to have to take care of missing values, select the appropriate features, and ensure that the datatypes are correct.

After that, we will perform a multiple linear regression analysis using the Statsmodels library and create a model that can predict the number of gyms for each neighborhood. We will assess the accuracy of our model before comparing the predicted values of our model with the actual values of the neighborhoods. These values are going to be critical to our analysis.

1. The actual values represent the current number of gyms being supplied to the population in each neighborhood.

2. The predicted values represent the estimated total demand for gyms.

3. The predicted values minus the actual values will equal the **demand for gyms that has not yet been supplied.** This is the most critical value we will be looking at. Neighborhoods with high values of this feature will be the locations we recommend to Jimmy.