# Young Scientist in Search for the Dream City

_Anna A. Stepanova, Ph.D_

<a id="item1"></a>
## Introduction/Business Problem

Here I present a Capstone Project for IBM Specialization.

The BioWiz University of Wonderland is organizing a job fair for BioTechMed School. In the preliminary survey of young scientists - graduate students and postdocs of the University, the Career Development Office identified a list of factors for choosing an ideal work location. By the time of the job fair, the office needs to prepare a review of the best cities suitable for young scientists and their families. It'll become a basis for the recommendation system that ranks the cities based on the respondent's requests and expectations.
Let's review the list of factors. First of all, it should be a city known for its biotech or biomedical research. All affiliates of BioWiz U follow the best trends in education and recognize that physical activity matters not only for their health but for their creativity and work performance. Therefore, in general, respondents found essential dance studios, bike paths and shops, stadiums. Among recreational spots, the most common choice fell to theaters, museums, art galleries, and nightlife spots. For most respondents, the crime rate and availability of good preschools were crucial because they had families with kids. A group of respondents insisted on having a developed public transportation system. Non-essential yet popular requests were outdoors and recreations, scenic look-outs, trails, and spiritual centers. Respondents with certain health conditions were opposed to living in cities with hot and humid climates. 
During the job fair, BioWiz U will present their project, identify critical factors for choosing a dream city, and suggest ways to apply similar recommendation systems in other departments and universities. The latter will include the discussion of surveys, how to translate these data into a problem that can be solved by data scientists. 

## Table of Contents


1. [Introduction/Business Problem](#item1)
2. [Data](#item2)

    2.1. [Crime Data](#item21)




<a id="item2"></a>
## Data

The list of the cities best suited for biotechnology or biomedical research in Canada and the USA will be obtained from https://www.glassdoor.com/ from the "Top Cities" category and stored as a .csv file. Using **geopy** package, we'll find geographical cities' coordinates to use them with Foursquare API. Using Visual Crossing Weather API, we'll obtain historical weather data for chosen cities and store summarized results for the past five years in a .csv file for future access. Generalized crime rates data (crime index and safety index) for US and Canadian cities will be obtained by scraping https://www.numbeo.com/crime/. Finally, using Foursquare API, we'll search specific categories of venues (Arts & Entertainment, Country Dance Club, Music Venue, Dance Studio, Stadium, Rock Climbing Spot, Nightlife Spot, Outdoors & Recreation, Bike Trail, Preschool, Spiritual Center, Travel & Transport, Scenic Lookout, Trail).
We'll search for venues within city limits.


Having gathered all the required data, we'll address the following objectives:


1) We'll cluster cities to identify overall similarities between them based on crime and weather data;

2) We'll cluster cities based on venue selections;

3) (OPTIONAL) We'll generate a fake survey dataset (set of preferences for choosing a dream city) and try to build a recommendation system allowing us to match a particular respondent with the list of best-matched cities.

***
Let's download required packages before we explore the data

In [1]:
import numpy as np # library to handle data in a vectorized manner

#!pip install --user pandas==1.0.3

import pandas as pd # library for data analsysis


pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!pip install geopy
# uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe



# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium 
# uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

from bs4 import BeautifulSoup # web scrapping library

#import plotting libraries
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import seaborn as sns

print('Libraries imported.')

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
[K     |████████████████████████████████| 102kB 9.5MB/s eta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0
Libraries imported.


<a id="item21"></a>
### 2.1. Crime Data

In this section, we'll scrape web data from https://www.numbeo.com/crime/, in order to obtain crime data for the USA and Canada that is in the table of cities, crime index, safety index.
We'll use _BeautifulSoup_ library to extract the table from the web-page.
***
Let's write a function that takes url as an argument and returns a data frame containing City, Crime Index, and Safety Index columns:

In [3]:
def crime_table(url):
    # this function will extracti a crime table from https://www.numbeo.com/crime/ using url for specific country
    # arguments: urt
    # return: a data frame containing City, Crime Index, and Safety Index columns
    
    
    # import the library we use to open URLs
    import urllib.request
    
    page = urllib.request.urlopen(url)
    
    # parse the HTML from our URL into the BeautifulSoup parse tree format
    soup = BeautifulSoup(page, "lxml")
    
    # find the table with cities and crime/safety indices
    table = soup.find('table', class_="stripe row-border order-column compact")
    
    
    #Step 2: create a data frame
    ### Let's get column data
    #Initialize the columns
    A=[]
    B=[]
    C=[]


    for row in table.findAll('tr'):
        cells=row.findAll('td')
        if len(cells)==4:
            A.append(cells[1].find(text=True))
            B.append(cells[2].find(text=True))
            C.append(cells[3].find(text=True))

    #### Now let's create a data frame using pandas library
    crime=pd.DataFrame(A,columns=['City'])
    crime['Crime Index']=B
    crime['Safety Index']=C
    
    return(crime)
    
    
    

Now we can set URLs for Canada and the United States to get desired data sets.

In [4]:
# Canada Crime Data URL
# specify the URL of the web page page we are going to be scraping
ca_url = "https://www.numbeo.com/crime/country_result.jsp?country=Canada"

## US Crime Data URL
# specify the URL of the web page page we are going to be scraping
us_url = "https://www.numbeo.com/crime/country_result.jsp?country=United+States"

Applying _crime_table_ function we can retrive crime data from the web. After creating separate data frames, let's combine them and check dimentions of a new data frame as well as its first and last 10 entries.

In [8]:
# Canada Crime Data 
ca_crime = crime_table(ca_url)
ca_crime['Country'] = "Canada"

# US Crime Data
us_crime = crime_table(us_url)
us_crime['Country'] = "USA"


# Combine data frames into a single table
crime_df = ca_crime.append(us_crime, ignore_index = True)

print(crime_df.shape)
crime_df.head(10)

(85, 4)


Unnamed: 0,City,Crime Index,Safety Index,Country
0,Surrey,61.3,38.7,Canada
1,Red Deer,60.23,39.77,Canada
2,Winnipeg,57.2,42.8,Canada
3,Regina,56.12,43.88,Canada
4,Brampton,55.61,44.39,Canada
5,Kelowna,50.21,49.79,Canada
6,Oshawa,50.03,49.97,Canada
7,Hamilton,49.83,50.17,Canada
8,Saskatoon,49.4,50.6,Canada
9,"Nanaimo, BC",46.45,53.55,Canada


In [9]:
crime_df.tail(10)

Unnamed: 0,City,Crime Index,Safety Index,Country
75,"Boise, ID",37.46,62.54,USA
76,"Brooklyn, NY",37.44,62.56,USA
77,"San Diego, CA",36.4,63.6,USA
78,"Boston, MA",35.52,64.48,USA
79,"Austin, TX",34.12,65.88,USA
80,"Raleigh, NC",33.83,66.17,USA
81,"Salt Lake City, UT",31.76,68.24,USA
82,"El Paso, TX",31.13,68.87,USA
83,"Madison, WI",31.11,68.89,USA
84,"Irvine, CA",19.12,80.88,USA
