# Data

    The initial Data needed for analysis will be obtained from Wikipedia.
    The following Wikipedia page - https://en.wikipedia.org/wiki/BH_postcode_area
    provides good information on different Borough and Neighborhoods in Bournmouth City
    including the postal code.
    
    We will use this data to answer the following questions:
        1. Is the City decent size; how many different Neighbourhoods are present
        2. How does the Neighborhoods compare to each other, is there a Borough
           that has more Business opportunities that others?
        3. Finally what business opportunities exist and some business recommendations

## Methodology

    While there are many methods for Data analysis, in this section we will use Python and 
    associated Machine Learning modules.
    
    We will discuss some of the modules we will use for our Analysis.

### Foursquare API
    Foursquare is a social location service that allows users to explore the world around them.
    The Foursquare API can be obtained by signing-up for Developer access. This API access keys
    allows the application developers to interact with the Foursquare platform. The API itself 
    is a RESTful set of addresses to which you can send requests and receive the output in XML
    or JSON formats. This data can then be mined based upon our needs.
    
    Example of data obtained using API:
    The API looks like this:
        'https://api.foursquare.com/v2/venues/explore?&client_id=<client-id>&client_secret=<client_secret>&v=<version>=<latitude>,<longitude>&radius=<span-radius>,limit=10'

    Example of Data Obtained from API:
       'venue': {'id': '4c70e6a0d97fa1439e87f7ca',
       'name': 'Kurpark',
       'location': {'address': 'Kurpromenade',
        'lat': 48.79974097267919,
        'lng': 8.438299101627925,
        'labeledLatLngs': [{'label': 'display',
          'lat': 48.79974097267919,
          'lng': 8.438299101627925}],
        'distance': 287,
        'cc': 'DE',
        'city': 'Bad Herrenalb',
        'state': 'Baden-Württemberg',
        'country': 'Deutschland',
        'formattedAddress': ['Kurpromenade', 'Bad Herrenalb', 'Deutschland']},
       'categories': [{'id': '4bf58dd8d48988d163941735',
         'name': 'Park',
         'pluralName': 'Parks',
         'shortName': 'Park',
         'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/park_',
          'suffix': '.png'},
         'primary': True}],
       'photos': {'count': 0, 'groups': []}},
      'referralId': 'e-0-4c70e6a0d97fa1439e87f7ca-1'},
      {'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      
      Note that in the above example, there is plenty of data for this specific location (kurpark). We can mine the 
      data for location, reviews, url, reviews.etc.
      
      We will use this method to get details for all Neighborhoods and extract our desired data,
      venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng'

## SciKit-Learn
    Now that we have the data, our next step is to analyze the data. We will python's SciKit-learn module.
    Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface
    and streamkined API.
    In this section, we will use scikits normalization algorithm OneHot Encoding. This is to normalize the data so
    that one or more variable does not skew our results. This method typically converts the categorical variables
    into numerical values.

    One other important module is the K-means clustering. This is one of the popular clustering algorithm. 
    The goal of this algorithm is to find groups(clusters) in the given data set.
    
    Algorithm:
    Our algorithm works as follows; we have different Neighborhoods say x1,x2,x3,...,xnx_1, x_2, x_3, ..., x_n and value of K
    Step 1 - Pick K random points as cluster centers called centroids.
    Step 2 - Assign each xi to nearest cluster by calculating its distance to each centroid.
    Step 3 - Find new cluster center by taking the average of the assigned points.
    Step 4 - Repeat Step 2 and 3 until none of the cluster assignments change.
    
    This logic brings Neighbourhoods that are similar together. We will cluster our dataset and determine the top 5 clusters.
    For examples, the neighborhoods that have pubs, or restaurants are grouped together. (like Bournmouth and Poole)

## Plotting packages - matplotlib and Folium
    Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. 
    It provides an object-oriented API for embedding plots into applications.
    
    Folium is a powerful data visualization library in Python that is built on top of matplotlib. It help people visualize 
    geospatial data. With Folium, one can create a map of any location in the world using the latitude and longitude values.
    
    In this section we will import matplot and Folium and then we will cluster the Neighborhoods and plot Folium geospatial map
    to visualize the cluster of Neighborhoods on the map. A visual representation in most cases is easy to understand
    Example: If we want to see where in the map all German Restaurants are located, it becomes possible using this approach.

## Methodology Details
The following is the high-level summary of the approach. This is categorized into different sections for ease of understanding:

Section-1
- We use the Python module BeautifulSoup scrape the data from the wikipage. The data we are 
  interested are PostalCode, Borough and Neighborhoods (https://en.wikipedia.org/wiki/BH_postcode_area)                                                       )
- Then we obtain the latitude and longitude values using geolocator module

Section-2
- Obtain the Venue details for each of the Neighborhoods using Foursquare APIs
- Next we get the top 10 Venue with-in the given area span


Section-3
- We can then find the top 10 Venues for each of the Neighborhoods
- Then categorize our results in a DataFrame
- We then use K-means to Cluster the Neighborhoods that are similar
- This is then marked onto the map as cluster of Neighborhoods
- Now we have sufficient data to examine the top 5 Neighborhood clusters
- We can also plot the clusters on the map using Folium package
   