# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)

## Introduction: Business Problem <a name="introduction"></a>

**Inkjet printers with ink tanks** have been gaining popularity in emerging countries like India. This is due to the appeal of lifetime ink that comes with the printer, tackling the perennial issue of frequent purchases of costly ink supplies.

Epson is the market leader currently with HP catching up hard to gain market share. With limited go to market funds and retail costs, **how can HP effectively compete in different countries by being selective in the in-country choice of cities to promote their printers with targeted product placement?**

For this project data science will be used to answer the above question. This will be a valuable tool for HP to gain market share rapidly without eroding profits unnecessarily to achieve it.

**India will be the chosen country** as it is one of the main battlegrounds of the ink tank printer war.

## Data <a name="data"></a>

**Google Trends** with the following parameters is used to generate the choice of cities for analysis in India by the amount of search interest:
* Search term **'tank printer'**
* Time range will be from Jan 2016 to Jan 2021
* Due to limitations of **pytrends API**, data will be imported for analysis using **Google Trends** generated csv file

**GeoPy API** is used to generate latitude and longitude data for India and selected cities.

**Foursquare API** is used to generate top 10 most common places where printers can be displayed and purchased within a 30km radius from city centre.

### Create dataframe from Google Trends csv file

Using the keywords "tank printer" the cities with search interest trend are listed and exported as a csv file.

**pytrends API** unfortunately is unable to generate data by cities for India so it is not used in the this analysis.

In [4]:
#Google trends on keyword "tank printer" for India by city for last 5 years
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_e8b4efdc254b4e3cba32290b0c8527d1 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='cNcMJRv8jZBy0FmKCDeWSa8D0zLGO_AKueTuJC1FNXyd',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.ap-geo.objectstorage.service.networklayer.com')

body = client_e8b4efdc254b4e3cba32290b0c8527d1.get_object(Bucket='pythonbasicsfordatascienceproject-donotdelete-pr-8x8em0stoa3wot',Key='India_search_for_ tank_printer_Jan_2016_to_Jan_2021_from_Google_1Jan21.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_gtrend = pd.read_csv(body)
df_gtrend.head()

Unnamed: 0,City,trend_count
0,Kolkata,100
1,Ghaziabad,94
2,Noida,92
3,Mumbai,91
4,Gurgaon,84


### Generate latitude and longitude data for the listed cities using GeoPy

In [5]:
!pip install geopy # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from geopy.extra.rate_limiter import RateLimiter



In [6]:
geolocator = Nominatim(user_agent="ny-application")
df_gtrend['gcode'] = df_gtrend['City'].apply(geolocator.geocode)
df_gtrend['lat'] = [g.latitude for g in df_gtrend['gcode']]
df_gtrend['long'] = [g.longitude for g in df_gtrend['gcode']]
df_gtrend

Unnamed: 0,City,trend_count,gcode,lat,long
0,Kolkata,100,"(Kolkata, Howrah, West Bengal, India, (22.5414...",22.541418,88.357691
1,Ghaziabad,94,"(Ghaziabad, Uttar Pradesh, India, (28.711241, ...",28.711241,77.444537
2,Noida,92,"(Noida, Dadri, Gautam Buddha Nagar, Uttar Prad...",28.535633,77.391073
3,Mumbai,91,"(Mumbai, Mumbai Suburban, Maharashtra, India, ...",19.07599,72.877393
4,Gurgaon,84,"(Gurgaon, Gurugram, Haryana, India, (28.428262...",28.428262,77.0027
5,New Delhi,62,"(New Delhi, Delhi, India, (28.6138954, 77.2090...",28.613895,77.209006
6,Coimbatore,57,"(Coimbatore, Coimbatore North, Coimbatore Dist...",11.001812,76.962842
7,Bengaluru,53,"(Bengaluru, Bangalore North, Bangalore Urban, ...",12.97912,77.5913
8,Kochi,47,"(Kochi, Ernakulam district, Kerala, 682005, In...",9.93137,76.267376
9,Chennai,46,"(Chennai, Chennai District, Tamil Nadu, India,...",13.083694,80.270186


### Use Folium to visualize the chosen India cities with the search trend ranking shown in pop up

Download all the dependencies that are required.

In [7]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install geopy # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt 

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install folium # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting folium
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 2.1 MB/s  eta 0:00:01
[?25hCollecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.11.0
Libraries imported.


Get the geographical coordinates of India.

In [8]:
address = 'India'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of India are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of India are 22.3511148, 78.6677428.


Visualize India and the selected cities from Google Trends.

In [80]:
!pip install folium
import folium

# Create India map
map_india = folium.Map(location=[latitude, longitude], zoom_start=5)

# Add markers
for lat, lng, city, trend_count in zip(df_gtrend['lat'], df_gtrend['long'], df_gtrend['City'], df_gtrend['trend_count']):
    label = '{} {}'.format(city, trend_count)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=7,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_india)  
    
map_india



### Generate common places where printers can be displayed and purhcased in a 30km radius around the city centre using Foursquare

The following Foursquare categories are chosen to narrow down the data generated:
* Shopping Mall, 4bf58dd8d48988d1fd941735
* Shopping Plaza, 5744ccdfe4b0c0459246b4dc
* Outlet Mall, 5744ccdfe4b0c0459246b4df
* Outlet Store, 52f2ab2ebcbc57f1066b8b35
* Paper/Office Supplies Store, 4bf58dd8d48988d121951735
* Electronics Store, 4bf58dd8d48988d122951735
* Department Store, 4bf58dd8d48988d1f6941735
* Bookstore, 4bf58dd8d48988d114951735

Define Foursquare Credentials and Version.

In [11]:
CLIENT_ID = 'PXSRO5SKINAAQJNPRVFCCV1HRRTXJFRJ2LTMUSUXQ4NYXF03' # your Foursquare ID
CLIENT_SECRET = 'XBU45UFEDM1PLUXZQF0IPD5K0UPKLNG4AFIFQQBP034V04AFt' # your Foursquare Secret
ACCESS_TOKEN = '0SLF0JB3S5CMYQKWZO3GKBCDKNOOI2MZFJO1IVVWWVSAMOYE' # your FourSquare Access Token
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: PXSRO5SKINAAQJNPRVFCCV1HRRTXJFRJ2LTMUSUXQ4NYXF03
CLIENT_SECRET:XBU45UFEDM1PLUXZQF0IPD5K0UPKLNG4AFIFQQBP034V04AFt


Explore the first city in dataframe as an example.

Get the city's name and geographical location.

In [12]:
city_latitude = df_gtrend.loc[0, 'lat'] # city latitude value
city_longitude = df_gtrend.loc[0, 'long'] # city longitude value

city_name = df_gtrend.loc[0, 'City'] # city name

print('Latitude and longitude values of {} are {}, {}.'.format(city_name, 
                                                               city_latitude, 
                                                               city_longitude))

Latitude and longitude values of Kolkata are 22.5414185, 88.35769124388872.


Now, let's get the top 100 venues within a radius of 30km.

First, let's create the GET request URL.

In [13]:
# Form Foursquare URL with categories where printers can likely be purchased
LIMIT = 100

radius = 30000

printer_categories = '4bf58dd8d48988d1fd941735,5744ccdfe4b0c0459246b4dc,5744ccdfe4b0c0459246b4df,52f2ab2ebcbc57f1066b8b35,4bf58dd8d48988d121951735,4bf58dd8d48988d122951735,4bf58dd8d48988d1f6941735,4bf58dd8d48988d114951735'

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&oauth_token={}&radius={}&limit={}'.format(
       CLIENT_ID,
       CLIENT_SECRET,
       VERSION,
       city_latitude,
       city_longitude,
       printer_categories,
       ACCESS_TOKEN,
       radius,
       LIMIT)

url

'https://api.foursquare.com/v2/venues/explore?&client_id=PXSRO5SKINAAQJNPRVFCCV1HRRTXJFRJ2LTMUSUXQ4NYXF03&client_secret=XBU45UFEDM1PLUXZQF0IPD5K0UPKLNG4AFIFQQBP034V04AFt&v=20180605&ll=22.5414185,88.35769124388872&categoryId=4bf58dd8d48988d1fd941735,5744ccdfe4b0c0459246b4dc,5744ccdfe4b0c0459246b4df,52f2ab2ebcbc57f1066b8b35,4bf58dd8d48988d121951735,4bf58dd8d48988d122951735,4bf58dd8d48988d1f6941735,4bf58dd8d48988d114951735&oauth_token=0SLF0JB3S5CMYQKWZO3GKBCDKNOOI2MZFJO1IVVWWVSAMOYE&radius=30000&limit=100'

Send the GET request and examine the results.

In [14]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ff08032dad389062d7cd86b'},
 'notifications': [{'type': 'notificationTray', 'item': {'unreadCount': 0}}],
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Kolkata',
  'headerFullLocation': 'Kolkata',
  'headerLocationGranularity': 'city',
  'query': 'mall',
  'totalResults': 43,
  'suggestedBounds': {'ne': {'lat': 22.81141877000027,
    'lng': 88.64947921959752},
   'sw': {'lat': 22.27141822999973, 'lng': 88.06590326817992}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '5274b7f111d2b513631071a5',
       'name': 'Quest Mall',
       'location': {'address': '33, Syed Aamir Ali Ave',
        'lat': 22.539068009925764,
        'lng': 88.3655245668

All the information is in the items key. Use the get_category_type function from the Foursquare lab.

In [21]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Clean the json and structure it into a pandas dataframe.

In [22]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  app.launch_new_instance()


Unnamed: 0,name,categories,lat,lng
0,Quest Mall,Shopping Mall,22.539068,88.365525
1,South City Mall,Shopping Mall,22.501758,88.361726
2,Acropolis Mall,Shopping Mall,22.514823,88.393235
3,City Centre,Shopping Mall,22.587921,88.408098
4,Mani Square Mall,Shopping Mall,22.577823,88.400591


How many venues were returned by Foursquare?

In [23]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

43 venues were returned by Foursquare.
