# The Battle of the Neighborhoods Full Report

## 1. Introduction:

### The Business Problem:

In this project we create a fictional character named Jimmy that is looking to open a new gym somewhere in Toronto. Jimmy begins researching potential locations only to discover that Toronto is enormous. In fact, there are 140 different neighborhoods in Toronto! Jimmy is shocked but he knows he is going to have to choose one of these neighborhoods to open his new business. But which one should he choose and how in the world is he going to narrow down his options and choose the best one?

Jimmy suspects that the best neighborhood for his new gym will have the following characteristics:

* **High income:** This makes sense because people will presumably be able to purchase more gym memberships if they have more disposable income.
* **High population:** A higher population means more people to go to the gym.
* **Large area:** The bigger the area in terms of square kilometers the more room there is for a gym.
* **A low number of already existing gyms:** If there aren't very many gyms it means the demand for gyms hasn't yet been satisifed.

But these are just Jimmy's suspicions. In this research project, we are going to have to collect actual data and test these suggestions to find out how accurate they truly are.

It is going to be our job to look through the attributes of each Toronto neighborhood in order to locate the best neighborhood for Jimmy's new business. Let's get started!

### Who Would Be Interested in This Project?

**Business people:** Business people that are looking to open a new business would be very interested in this project. This project is useful to them because it uses data to show them where their business is most likely to succeed.

**Investors:** Investors are people that provide funds to business people so that they can purchase the necessary equipment and materials to get the business running. Investors hope that the business is successful so that they can make a healthy return on their investment. Data science projects like this one are useful to investors for the same reason they are useful to business people. They can help investors locate the best locations for investment.

**Academics:** Academics and researchers are interested in learning more about about the 'why' and the 'what' of different things. They would want to learn what the best locations are, why these locations are the best, and why this information is useful. Academics would then publish their research once they have answers to these questions.

**Data Scientists & Analysts:** Data scientists are the individuals being hired by the business people and investors to complete the project. They specialize in data analytics, statistics, and programming.

## 2. Data Acquisition and Cleaning:

For this section we will be using the following data source:

### Neighbourhood Profiles 2016:

This dataset was obtained through the official government website of Toronto, *Toronto.ca*. The dataset is created once every five years based on the demographic features that are obtained through Canada's Census of Population every five years. The dataset includes a vast collection of data on the age, gender, race, location, income, language spoken, national origin, and many other useful characteristics of its residents. The 2016 dataset will be used for this project.

More information about the dataset can be found at the following links:  
https://www.toronto.ca/city-government/data-research-maps/neighbourhoods-communities/neighbourhood-profiles/  
https://open.toronto.ca/dataset/neighbourhood-profiles/

Importantly, the dataset contains the above features for each of Toronto's 140 neighborhoods. It provides us with a useful way to compare and contrast these neighborhoods as we analyze their similarities and differences in depth.

Overall, the data includes a stunning 2,383 features, which is obviously far too many to analyze in a data science project. In this project we will have to choose just a few features to make sure that our project is manageable and focused. Below are some features in the dataset that could be interesting as well as the rationale for including them:

### Best Dataset Features:

#### We will definitely include these ones in our analysis:
* **Population, 2016:** A higher population should lead to more gyms.
* **Total private dwellings:** A more wealthy population means more resources that can go towards gym memberships.
* **Land area in square kilometres:** A greater land area means more room for gym locations.
* **Children (0-14 years):** More children could mean more gyms because these children will become adults in a couple years and then be interested in going to the gym.
* **Youth (15-24 years):** Youth are especially interested in going to the gym and staying fit.
* **Working Age (25-54 years):** A higher population of these individuals means more gyms.
* **Pre-retirement (55-64 years):** A higher population of these individuals means more gyms.
* **Seniors (65+ years):** A higher population of these individuals means more gyms.
* **Older Seniors (85+ years):** A higher population of these individuals means more gyms.
* **Employment rate:** Higher employment means higher income, which means more money for gym memberships.
* **Unemployment rate:** Higher unemployment should reduce income, leading to less gyms.
* **Income taxes: Average amount (\$):** Income taxes lower disposable income available to spend on gym memberships.
* **After-tax income: Average amount (\$):** More income to spend on gym memberships.

#### We will possibly include these ones in our analysis:
* **No certificate, diploma or degree:** Maybe less educated individuals are less likely to go the gym?
* **Secondary (high) school diploma or equivalency certificate:** How likely to go to the gym are moderately educated individuals?
* **University certificate, diploma or degree at bachelor level or above:** How likely to go to the gym are highly educated individuals?

### Foursquare API:

The other major data source we will be using for this project is the Foursquare API. We will primarily be using this API to count the number of gyms that exist in each neighborhood. The number of gyms for each neighborhood will be our dependant variable, the one we are trying to predict.

### Next Steps:

For our next step, we are going to have to clean and wrangle the data. We are going to have to take care of missing values, select the appropriate features, and ensure that the datatypes are correct.

After that, we will perform a multiple linear regression analysis using the Statsmodels library and create a model that can predict the number of gyms for each neighborhood. We will assess the accuracy of our model before comparing the predicted values of our model with the actual values of the neighborhoods. These values are going to be critical to our analysis.

1. The actual values represent the current number of gyms being supplied to the population in each neighborhood.

2. The predicted values represent the estimated total demand for gyms.

3. The predicted values minus the actual values will equal the **demand for gyms that has not yet been supplied.** This is the most critical value we will be looking at. Neighborhoods with high values of this feature will be the locations we recommend to Jimmy.

## 3. Exploratory Data Analysis:

We began our exploratory data analysis by using the pandas library to explore the descriptive statistics of our dataset. We gained some important insights about the minimum value, 25% percentile, median, 75% percentile, and maximum value of the different features of our dataset.

We decided that we were going to be predicting the number of gyms located in each neighborhood, which was a supervisory machine learning technique. Because the number of gyms was a numerical value rather than a categorical value, we decided to perform a multiple linear regression analysis. We labeled the number of restaurants as the dependant variable and all other variables as the independant variables.

In preparation for our regression analysis, we created a correlation matrix between our dependant and independant variables. We found out that some independant variables were more closely related to the dependant variable than others, which made them more valuable in our eminent regression analysis. We removed the independant variables that had low correlations with the dependant variable. We also removed features that didn't make logical sense or features that we couldn't use to tell a "compelling story".

We then performed our regression analysis using the statsmodels library. We noted that some of the variables had p-values above 0.10, which meant they had a low probability of being statistically significant. We gradually removed these variables from the analysis and ran the regression over and over until all of the variables had a p-value lower than 0.10, which was our chosen significance level for this project.

We then took a look at the final parameters. We saw that the land area had a positive coefficient, showing that an increase in the size of the neighborhood area led to a higher number of gyms. Additionally, each of the individual age groups had a positive coefficient, meaning that a higher population for all of our age groups led to a greater number of gyms. Surprisingly, the employment rate and the average after-tax income both had negative parameters, which meant that as employment and income increased, the number of gyms actually decreased. Maybe that's because as the economy strengthened people made more money and bought at home exercise equipment instead of buying gym memberships.

We also noted that the model achieved an r-squared value of 0.639, which is impressive, considering that only nine different features were included in the analysis.

After that, we went ahead and used the predict function from the statsmodels library to predict how many gyms each neighborhood would have. We then created bar charts of the actual number of gyms, the predicted number of gyms, and the predicted value minus the actual value.

We concluded that our model was capable of producing a fairly accurate estimate of how many gyms could be supported by the individuals of each neighborhood. With the assumption that the model predictions were fairly accurate, we decided to compare the actual number of gyms to the predicted number of gyms. The idea here was that if the actual number of gyms was lower than the predicted number of gyms or 'potential demand for gyms' for a given neighborhood, then that neighborhood would be an excellent place to open a new gym.

With that, we concluded that the neighborhoods with the highest value of the feature, 'Predicted Value minus Actual Value' would be the best locations to open a new gym.

# 4. Results:

## This section contains various images of dataframes and charts that were created throughout the analysis:

### This is our dataframe after cleaning the Toronto dataset and selecting our features:

![Dataframe%20after%20cleaning.JPG](attachment:Dataframe%20after%20cleaning.JPG)

### Here is the map of our Toronto neighborhoods after obtaining their latitude and longitude values:

![Map%20of%20all%20neighborhoods.JPG](attachment:Map%20of%20all%20neighborhoods.JPG)

### This is the output generated from our Regression Analysis.

![Regression%20Results.JPG](attachment:Regression%20Results.JPG)

### This is the dataframe comparing the actual number of gyms, the predicted number of gyms, and the difference between them.

![Actual,%20Predicted%20Dataframe.JPG](attachment:Actual,%20Predicted%20Dataframe.JPG)

### This is the chart of the Predicted Values minus the Actual Values by Neighborhood.

![Predicted%20minus%20Actual%20Chart.JPG](attachment:Predicted%20minus%20Actual%20Chart.JPG)

## 5. Recommendations

In terms of the 'Predicted - Actual' feature, the following neighborhoods are the best locations to open up a new gym:

1. Rouge
2. Humber Summit
3. York University Heights
4. Forest Hill South
5. Little Portugal
6. South Parkdale
7. Annex
8. High Park North
9. Henry Farm
10. Mount Pleasant West

Our suggestion for Jimmy is to begin at the top of this list with Rouge, which currently has a 'deficit' of ~13 gyms and see if he can find a place to open his new gym. If he is unable to locate a place for his gym, it is suggested that he move to the next neighborhood until he finds a location that he is comfortable with.

Below, we see the gyms with the lowest absolute 'Predicted - Actual' feature values. According to our model, these neighborhoods already have way too many gyms for their area, population, and income. It is possible that some of these gyms may go out of business in the near future due to the 'excess' number of gyms. Under no circumstances should Jimmy consider opening up a gym in any of the following neighborhoods:

1. Yonge-Eglinton
2. Islington-City Centre West
3. Hillcrest Village
4. Willowdale East
5. Lansing-Westgate
6. New Toronto
7. Mount Pleasant East
8. Willowdale West
9. South Riverdale
10. St.Andrew-Windfields

## 6. Conclusion

It's time to conclude our report. Let's do a quick overview of what we've done so far.

#### Introduction:

In the introduction we introduced the main reason for our analysis. Jimmy wanted to open up a new gym and it was our job to find the best location for him.

#### Data Acquisition and Cleaning:

In this section we acquired our data before performing extensive cleaning on it to take care of missing values, ensure that the datatypes were correct, and select the necessary features.

#### Exploratory Data Analysis:

In this section of our report we performed a descriptive statistics analysis, a correlation analysis, and a multiple linear regression analysis. We explained that we were looking for neighborhoods where the predicted values were greater than the actual values.

#### Results:

In this section we posted the results of our analysis. More specifically, we pasted a collection of dataframes and charts that represented the analysis that we had performed.

#### Recommendations:

In this section we recommended to Jimmy the neighborhoods that scored the best according to our model, that is, the neighborhoods with the highest 'deficit' of gyms. We also recommended that Jimmy stay away from those locations that received poor scores, that is, those neighborhoods that have the highest 'surplus' of gyms.

#### Conclusion:

That's it. We hope that Jimmy is able to use our analysis to help him decide on the best location in Toronto to open up his new gym. Starting a new business is always a risky endeavor and the more information Jimmy has, the better off and more prepared he will be.