# Capstone Project - The Battle of the Neighborhoods (Week 2)
## Applied Data Science Capstone by IBM/Coursera

## Table of contents
#### 1. Introduction: Business Problem
#### 2. Data
#### 3. Methodology & Analysis
#### 4. Results and Discussion
#### 5. Conclusion

## Introduction: Business Problem

in this project we  would like to solve the problem of finding the popular libraries in the state of North Carolina. North Carolina is known for its rich accessiblity to all public libraries to the general users. The idea here is to identify all the popular libraries in the state and the reason behind of being popular based on venues close by. we will be also identifying different venues close to the library which will help the team to identify what is the best criteria to open a new library in a new location

# Data 

 We will start with use of different steps in data science methodology. Now that the business requirement is already laid, it is time to find the approach we are going to take to collect, understand, analyse and prepare the data. The first step include collecting the data making use of publicly available library dataset, foursquare api to get the location of the library, publicly available popular books dataset like New york best sellers list. We are going to scrap away the PII data and only keep the publicly available data during the stage of data preprocessing. Now we analyse the data make sure what are the individual fields/columns are relevance and remove the unwanted fields. we will convert the text column to int by using transpose method so that we can fit model on the data. During the whole process of analyzing the data , we will make use of different plots to understand the data to its depth


## Dataset

 1.  Libraries in North Carolina 
 2.  Popular books
 3.  Foursquare API

In [1]:
# install all the dependencies

!pip install folium



In [2]:
# import all the necessary libraries

import types
import pandas as pd
from botocore.client import Config
import ibm_boto3
from geopy.geocoders import Nominatim
import requests
import numpy as np
from sklearn import preprocessing
%matplotlib inline 
import matplotlib.pyplot as plt
import folium # map rendering library

In [3]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,name,street,city,state,postalcode,LibraryId,isbn,title,publisher,format,type,patronid
0,Mooresville Public Library,7111 NC State University,RALEIGH,NC,27695,714,9781400123353,"Saxons, Vikings, and Celts","Tantor Media, Inc",eAudio,wishlist.post,530185
1,Cabarrus County Charles A. Cannon Library,27 Union St N,CONCORD,NC,28025,559,9781449884314,Not Without You,"Recorded Books, Inc.",eAudio,checkout.post,980440
2,Cabarrus County Charles A. Cannon Library,27 Union St N,CONCORD,NC,28025,559,9780804163972,Command Authority,Books on Tape,eAudio,checkout.post,761768
3,Cumberland County Public Library,300 Maiden Ln,FAYETTEVILLE,NC,28301,595,9781449890339,Chocolate Chip Cookie Murder,"Recorded Books, Inc.",eAudio,checkout.post,999141
4,Wilson County Public,7111 NC State University,RALEIGH,NC,27695,555,9781607880400,4th of July,Hachette Audio,eAudio,checkout.post,762054


## Methodology and Analysis


In this project we will direct our efforts on data collection. data preparation, generating a model, evaluating the model and finally plotting the data in the map to identify the popular library. 

First step  -  we are going to identify the unique postal code of all the libraries and then use the "geolocator" libraries to get the location information like latitude and longitude. this is necessary further to plot the location of the libraries in the map. We are going to do it in a separate data frame to avoid multiple hits to geolocator api theryby avoiding time-out issue. Next we will merge the location data into the original data frame. We will only consider the relevant columns like name, postal code, longitude, latitude information and ignore all other columns

Second step - Once the inital dataset is completed, we will have rough information of popular titles based on the no of checkouts information on each library. we will use that information to find the most popular library with help of dataframe.describe method. Next we will use Foursquare api to find all the venues close to the popular library to identify what are the venues is helpful to decide whether a library is popular or not

Third step - We will creating a model to identify the popular library. Since we need to find out whether a library is popular or not, it falls under category of binary classification(yes/no), so we would be using logistic regresssion to build a model and evaluate the dataset. We will use different evaluation technique like loggloss function, confusion matric etc to verify the results

In the fourth and the final step, we will use the map to plot all the popular libraries in the state of North Carolina and use plot markers to mark the libraries with no of checkouts



In [4]:
# get the unique postal codes of all the libraries

postalcode = df.postalcode.unique()
df_postalcode = pd.DataFrame({'postalcode': postalcode})
df_postalcode.head()

Unnamed: 0,postalcode
0,27695
1,28025
2,28301
3,28560
4,27203


In [None]:
# use geolocator to find the latitude and longitude

geolocator = Nominatim(user_agent="geoapiExercises")
for i, row in df_postalcode.iterrows():
    location = geolocator.geocode(row['postalcode']) 
    df_postalcode.at[i , 'long'] = location.longitude
    df_postalcode.at[i , 'lat'] = location.latitude

df_postalcode.head()  

Unnamed: 0,postalcode,long,lat
0,27695,-7.1967,42.884874
1,28025,-3.734437,40.384181
2,28301,-78.89644,35.098176
3,28560,-3.235347,40.2546
4,27203,14.140361,50.15685


In [None]:
# group the dataset based on name,type and postalcode and consider type ='checkout'

df_filtered = df[['name','type','postalcode']]
df_filtered = df_filtered[df_filtered["type"] == 'checkout.post']
df_filtered = df_filtered.rename(columns={"type": "checkoutcount"})
df_filtered = df_filtered.groupby(['name','postalcode'])["checkoutcount"].count().reset_index()
df_filtered.head()

Unnamed: 0,name,postalcode,checkoutcount
0,Alamance Community College,27695,7
1,Alamance County Public Libraries,27695,5905
2,Albemarle Regional,27695,42
3,Alexander County,28681,635
4,Appalachian Regional,27695,293


In [None]:
# merge the dataframe with latitude and longitude information

df_filtered = df_filtered.set_index('postalcode').join(df_postalcode.set_index('postalcode'))
df_filtered = df_filtered.sample(frac=1).reset_index(drop=True)
df_filtered.dropna(subset = ["long","lat"], inplace=True)
df_filtered.head()

Unnamed: 0,name,checkoutcount,long,lat
0,Carteret Community College,13,-7.1967,42.884874
1,Western Piedmont Community College,4,-7.1967,42.884874
2,Scotland County Memorial,2,-7.1967,42.884874
3,Nash County--Braswell Memorial Library,15,-7.1967,42.884874
4,Greensboro Public,8,-7.1967,42.884874


In [None]:
# create a new column 'label' which will be useful for graph marker

df_filtered['label'] = df_filtered[['name','checkoutcount']].apply(lambda x : '{} - ({})'.format(x[0],x[1]), axis=1)
df_filtered = df_filtered.sort_values(by=['checkoutcount'], ascending=False).reset_index(drop=True)
df_filtered.head(30)

Unnamed: 0,name,checkoutcount,long,lat,label
0,Cabarrus County Charles A. Cannon Library,35257,-3.734437,40.384181,Cabarrus County Charles A. Cannon Library - (3...
1,Pub Library CHARLOTTE & MEC,23237,-80.842216,35.22767,Pub Library CHARLOTTE & MEC - (23237)
2,Sheppard Memorial Library Pitt County,11762,-77.362316,35.595715,Sheppard Memorial Library Pitt County - (11762)
3,Craven-Pamlico-Carteret Regional,9981,-3.235347,40.2546,Craven-Pamlico-Carteret Regional - (9981)
4,Cumberland County Public Library,8405,-78.89644,35.098176,Cumberland County Public Library - (8405)
5,Alamance County Public Libraries,5905,-7.1967,42.884874,Alamance County Public Libraries - (5905)
6,Randolph County Public,5744,14.140361,50.15685,Randolph County Public - (5744)
7,Forsyth County Public,4296,-80.172191,36.121144,Forsyth County Public - (4296)
8,Fontana Regional,2522,-7.1967,42.884874,Fontana Regional - (2522)
9,Durham County,1849,-7.1967,42.884874,Durham County - (1849)


## Explore the neighbourhood of "Cabarrus County Charles A. Cannon Library" as it has the max checkouts

###  find out the nearby venues using Foursquare API and see why this library is so popular

In [None]:

# Credentails for Foursquare api

CLIENT_ID = '2MY4LJEFD4NRY2J21OJNWS20AZU4FNF0IOXXYR5ACBZXVQ5L' # your Foursquare ID
CLIENT_SECRET = 'TYQ2P0LOWAAKGGOE4R4EZOHWNMRE0J4GZTLJEPJDO4IEMKJJ' # your Foursquare Secret

VERSION = '20180605' # Foursquare API version
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


Your credentails:
CLIENT_ID: 2MY4LJEFD4NRY2J21OJNWS20AZU4FNF0IOXXYR5ACBZXVQ5L
CLIENT_SECRET:TYQ2P0LOWAAKGGOE4R4EZOHWNMRE0J4GZTLJEPJDO4IEMKJJ


In [None]:
#get the popular library

df_filtered.loc[0, 'name']

'Cabarrus County Charles A. Cannon Library'

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Library', 
                  'Library Latitude', 
                  'Library Longitude', 
                  'Nearby Venue', 
                  'Nearby Venue Latitude', 
                  'Nearby Venue Longitude', 
                  'Nearby Venue Category']
    
    return(nearby_venues)

In [None]:
popularlibrary_venues = getNearbyVenues(names=df_filtered['name'],latitudes=df_filtered['lat'],longitudes=df_filtered['long'])                                 
print(popularlibrary_venues.shape)
popularlibrary_venues.head()

Cabarrus County Charles A. Cannon Library
Pub Library CHARLOTTE & MEC
Sheppard Memorial Library Pitt County
Craven-Pamlico-Carteret Regional
Cumberland County Public Library
Alamance County Public Libraries
Randolph County Public
Forsyth County Public
Fontana Regional
Durham County
New Hanover County Public Library
Chatham County Public Libraries
Beaufort-Hyde-Martin Regional
GRANVILLE CO LIBRARY SYSTEM
Mooresville Public Library
Hocutt-Ellington Memorial Library
Brunswick County Library
Gaston County Public Library
DAVIDSON COUNTY PUBLIC LIBRARY
Neuse Regional Library
Wilson County Public
Sandhill Regional
High Point Public
NC State University
Alexander County
Buncombe County Public
Henderson County Public
Union County Public
Wake County
East Albemarle Regional
Nantahala Regional
Chapel Hill Public
Kings Mountain/Jacob S. Mauney Memorial
Western Carolina University
Person County Library
Appalachian Regional
Burke County Public Library
Bladen County Public Library
Northwestern Regional

# Identify the model to build popular libraries

In [None]:
#let go back to orginal dataset
df.head()

In [None]:
df["name"].describe()

In [None]:
# convert the type rows value into columns as checkout, hold and wishlist

df_format = pd.get_dummies(df, columns=['type'])
df_format.rename(columns={"type_checkout.post": "checkout","type_hold.post": "hold", "type_wishlist.post": "wishlist"}, inplace=True)
df_format = df_format.filter(items=['name', 'checkout','hold','wishlist'])
df_format.head()

In [None]:
# group by name and sum the total checkout, total holds, total wishlist

df_sum = df_format.groupby('name')['checkout','hold','wishlist'].sum().astype(int).reset_index()
df_sum.head()

### define popular titles based on no of checkouts

In [None]:
def set_popular(row):
    if row['checkout'] > 500:
        return 1
    else:
        return 0

df_popular = df_sum.assign(popular=df_sum.apply(set_popular, axis=1))
df_popular.head()

In [None]:
# deterine the x value

X = np.asarray(df_popular[['checkout', 'hold', 'wishlist']])
X[0:5]

In [None]:
# deterine the y value


y = np.asarray(df_popular['popular'])
y [0:5]

In [None]:
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

### Train/Test Dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

### Modeling (Logistic Regression with Scikit-learn)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

In [None]:
yhat = LR.predict(X_test)
yhat

In [None]:
yhat_prob = LR.predict_proba(X_test)
yhat_prob

## Evaluation

### jaccard index

In [None]:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)

### confusion matrix

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
print(confusion_matrix(y_test, yhat, labels=[1,0]))

In [None]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])
np.set_printoptions(precision=2)


# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['popular=1','popular=0'],normalize= False,  title='Confusion matrix')

In [None]:
print (classification_report(y_test, yhat))

### log loss

In [None]:
from sklearn.metrics import log_loss
log_loss(y_test, yhat_prob)

# Result

###  Plot the North Carolina in map with all the popular libraries that are nearby

In [None]:

address = 'North Carolina'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North Carolina are {}, {}.'.format(latitude, longitude))

map_geo = folium.Map(location=[latitude, longitude], zoom_start=8)

# add markers to map
for lat, lng, label in zip(df_filtered['lat'], df_filtered['long'], df_filtered['label'] ):
    label = folium.Popup(label, parse_html=True)
    folium.Marker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        icon=folium.Icon(color='green'),
        parse_html=False).add_to(map_geo)  
    
map_geo

# Conclustion

#### Purpose of this project was to identiy the popular libraries in the state of North Carolina. We saw from the dataset that based on no of checkouts information we could analyse and get all the popular libraries , however that was not enough to be an optimal solutions, so we made use of data science and built a logistic regression model to evaluate the popular libraries and we identifed there are many factors like nearby venues that determine the popularity of the library. this result can be used by different stakeholders to build a new library in the state of North Carolina.