# Bournemouth Venues Analysis
<a>https://www.kaggle.com/r3w0p4/bournemouth-venues</a>


# INTRODUCTION
This kernel will present an analysis from Kaggle's dataset Bournemouth Venues" (see link above). We have a dataset with venues in the city of Bournemouth. Our first task would be to load the csv file into a dataframe and see what we will work with. Then we will brainstorm some ideas on visualizing this data. 

# LOADING DATA INTO A DATAFRAME

Here we import the libraries. The reasoning behind why certain libraries are being used will be explained later. 

In [None]:
#!pip install wordcloud  # install wordcloud - run this once
import pandas as pd 
from urllib.request import Request, urlopen # for opening a connection to a website
from bs4 import BeautifulSoup #web scraping 
import folium #map visualization
import html
import sys # mostly used for debugging purposes
#matplotlib for graphs
import matplotlib.pyplot as plt
from wordcloud import WordCloud
%matplotlib inline 
import seaborn as sns
import re
    

Now we read the csv file and display some infomation about the dataframe as well as the first few rows of the dataset.

In [None]:
df = pd.read_csv('../input/bournemouth_venues.csv')

print("Infomation abut the dataset:\n")
df.info()

print("\nFirst few rows of the dataset:\n")
print(df.head())



As you can see, we have a number of venues and their correspoinding catagory. We also have a geospatial co-ordinate for each venue. From this I am thinking about a map and venues which may have reviews and possibly a count on the most popular categories. 

# Analysis Brainstorm

I think it will be interesting to investigate the following:
- A count of number of unique venues vs "chain" venues
- a worldcloud of venue category showing popularity of the different types
- a map view of the town showing markers of the different venues to see how far away they are in relation to the town and each other
- use beautifulSoup to grab review data for each unique venue (we wont be interested in chain venues because who wants to eat McDonalds on a holiday?? Go try new unique local things!!) from google and plot on a map. Perhaps use the review score and plot on the map - colour coded. 



However before we do any of this we need to ensure our data is in a suitable format and check if it needs cleaning.

# Cleaning the data

The data appears, at a first glance, to be in a suitable format. Lets check the last rows of the dataframe to see if there are in NaN elements:

In [None]:
df.tail()

It seems okay. However we may run into formatting issues later so we will deal with the data cleaning there and then since at the moment theres not alot more I can do apart from renaming teh column headers:

In [None]:
df = df.rename(columns={
'Venue Name': 'Name',
'Venue Category': 'Category',
'Venue Latitude': 'Latitude',
'Venue Longitude': 'Longitude'})

# Analysis: Count  unique venues vs "chain" venues
From the dataset we a mixture of venue types from hotels to cafes. There could be a possibiility of finding multiple venues thats are part of a chain like starbucks or McDonalds. Lets look into this and see how many unique venues we have.

In [None]:
# In excel you could use a pivot table to summarise the "Venue Name" column and find the number of occurances of each venue name. 
#Lets use something similar in pandas:

#create a pivot table to count the number of occurances in the "Name" column
df_pivot = pd.pivot_table(df, index=['Name'], aggfunc='count')

#Filter our pivot table to only show venues that occur more than once
df_pivot_filter = df_pivot[df_pivot.Category > 1]


Total_venues = len(df.Name) #total number of venues in our data
chain_venues = len(df_pivot_filter.Category) #of those venues how many of them are part of a chain?

print(f"In total we have {Total_venues} venues. Of these venues, {chain_venues} of them are part of a chain")

# Analysis: Most popular venue categories
Now we will have a look into the venue categories and gain insight into the most popular venue types. we will do this by plotting a world cloud as well as a bar chart of the most popular types.

In [None]:
#First generate a wordcloud of all the Categories

#Lets grab the Category column and put it into a list.
cata_list = df["Category"].values.tolist()

#Go through the list and remove all leading and trailing white space. 
cata_list = [ele.strip(" ") for ele in cata_list]


#Go through the list and unify any double barral words
cata_list = [ele.replace(" ","") for ele in cata_list]

#finally convert the list to a long string, seperated by a space
cata_list = ' '.join(cata_list)


#******The plot*******

#create a new figure instance
fig = plt.figure(figsize=(7,7))

#create a new axes for this plot. its a single graph so subplots are not needed
ax = fig.add_axes([1,1,1,1])

#Create the wordcloud object
wordcloudobj = WordCloud(width=500, height=500,
               margin=0, max_font_size=100, min_font_size=12,
               background_color="lightyellow", collocations = False).generate(cata_list)

#plot details
ax.set_title("WordCloud of different category types")
ax.axis("off") #turn off axis

# show plot
ax.imshow(wordcloudobj, interpolation='bilinear') # Display the generated image:


As you can see, there are a number of places which may have a similar level of popularity compared to other cataegories. We can look into this further by plotting a bar chart showing the count against category type:

In [None]:
#plotting a bar graph for number of occurances of category type

#create a pivot table to count the number of occurances in the "category" column
df_catapivot = pd.pivot_table(df, index=['Category'], aggfunc='count')
print(df_catapivot.head())

Our df_catapivot dataframe was formed from a pivot table operation. we can see the index is the category and the other columns are the number of occurances. lets clean this dataframe to make it more clear on whats going on

In [None]:
#we have df_catapivot as seen above we will clean this to prepare for bar plot:

#drop lat long columns:
del df_catapivot["Latitude"]
del df_catapivot["Longitude"]

#rename "Name" colum to "count" as, that is what its showing
df_catapivot = df_catapivot.rename(columns={'Name': 'Count'})

#set an integer index and keep the former index as a new column
df_catapivot = df_catapivot.reset_index(drop=False)

print(df_catapivot.head())

In [None]:
# And now for the bar plot

# Close previous figure:
plt.close(fig)

#create a new figure instance
fig = plt.figure(figsize=(7,7))

#create a new axes for this plot. 
ax = fig.add_axes([1,1,1,1])

#plot the bar plot using seaborn
sns.barplot(df_catapivot["Category"],df_catapivot["Count"])

Insight gained. We see a large number of X values however most of them have count = 1. And we are, after all, interested in the more popular ones.
lets plot this again, but filter the dataframe to remove all X values with count = 1.

In [None]:
#filter the dataframe to remove all X values with count = 1
df_catapivot_filt = df_catapivot[df_catapivot.Count > 1]


# And finally plotting the bar plot with filtered data

# Close previous figure:
plt.close(fig)

#create a new figure instance
fig = plt.figure(figsize=(7,7))

#create a new axes for this plot. 
ax = fig.add_axes([1,1,1,1])

#details

#set title
ax.set_title("Count of the different venue types in Bournemouth")
#rotate x-axis labels
for label in ax.xaxis.get_ticklabels():
    label.set_rotation(90)


#plot the bar plot using seaborn
sns.barplot(df_catapivot_filt["Category"],df_catapivot_filt["Count"])

# Analysis: Map Visualization of venue locations

Now I want to use follium to plot a map visualization of the different venues. We will gain some insight to where different venues are located in contrast to each other and generally in the town.


In [None]:

    
#First we need to center out map on Bournemouth. A simple google search reveals:
#"The latitude of Bournemouth, UK is 50.720806, and the longitude is -1.904755."

Bournemouth_coord = (50.720806, -1.904755)

# create empty map zoomed in on Bournemouth
bournemap = folium.Map(location=Bournemouth_coord, zoom_start=12)

rownum = len(df["Name"])
i = 0

for i in range(0,rownum):    

    name = df.loc[i,"Name"]
    cate = df.loc[i,"Category"]
    text = str(name + " (" + cate + ")")
    folium.CircleMarker(
        [df.loc[i,"Latitude"], df.loc[i,"Longitude"]],
        radius=8,
        popup = html.escape(text),
        color='blue',
        fill_color='green',
        fill=True,
        fill_opacity=0.7,
        clustered_marker = True
        ).add_to(bournemap)
    
    i += 1


#display our map
print("Map of Bournemouth with venue locations")
display(bournemap)







Just from looking at the geospatial plots and no prior knowledge of Bournemouth, you can see where the Town center is from the concentrated positioning of these venues. Its interesting to note that these points are all focused around the town center which itself is within the bounds of the green parks of bournemouth. This coupled with the scattered venues along the beach further indicate that this town is heavily focused on tourism. It would be interesting to see this compared to busy cities like central London or Manchester as i would expect there to be venues in rows along the close compact streets rather than exploiting the natural beauty as Bournemouth does.

# Review data intergration and analysis using Beautiful Soup

*****************************************************
NOTE: if there are errors in the below code.
REASON: Kaggle Kernels does not have internet access turned on by default. You may have to fork it and run it with the internet option selected on the setting pane. Otherwise run the code locally .
****************************************************

Now i want to loop through the data and for each unique (non chain) venue, i will use BeautifulSoup to effectivly:
    - search the venue name on google
    - search for googles sidebar which displays infomation about a location including any review
    - Extract this review data (should be a value from 1-5)
    - save it as a new column in our dataframe
    - plot a follium mapplot of the Town and plot them colour coded with respect to their rating. 
    
REMEMBER! we are not interested in these chain venues. so we wont be considering them in this part.
We have a dataframe that has been filtered using a pivot table which shows the venues that occur more than once
lets loop through this pivot table and drop those rows. 
Then use our new dataframe to loop through and grab review data. 

Lets have a look at this pivot table:
    

In [None]:
print("The pivot table - showing venues that are part of a chain of venues")
print(df_pivot_filter)

This is a dataframe. Lets reset the index while keeping the current index as this is the list of chain venues

In [None]:
print("Resetting the Index whilst keeping the original index but as a new column")
df_pivot_filter = df_pivot_filter.reset_index(drop=False)
print(df_pivot_filter)

In [None]:
#Now Make a copy of our df:
dfbs = df

#Add a new column "rating"
dfbs["Review"] = ""

#Now loop through "df_pivot_filter" and drop all rows containing the same venues 
row = 0
for chain in df_pivot_filter["Name"]:
    dfbs = dfbs[dfbs.Name != chain] #remove the rows contianing chain (which are the chain venues)
    row +=1
    
#we have removed rows. we need to reset the index
dfbs = dfbs.reset_index(drop=True)

#upon testing the code it was noticed that certain strings (Lola’s) threw errors.
#Need to loop through and remove all instances of "’"




i = 0

for i in range (len(dfbs["Name"])):
    
    venue = dfbs.loc[i,"Name"]
    
    #check if the string venue has "’"  and if so replace with"'"
    venue = venue.replace("’","'")

    #we wil effectivly do a google search. so we will constuct this here:
    #we search for: "UK Bournemouth venueName" - this should give us a side box with venue details in google
    location = "UK Bournemouth " + venue
    
    #constructing the weblink
    web = "https://www.google.com/search?safe=active&source=hp&ei=bo4vXfiVIY-yUtPGomg&q="
    web_full = web + location.replace(" ","+")
    
    #sending a request using urllib with added headers to ensure we get a connection
    #the header just tells the wepage that we are accessing it using mozilla verion 5.0
    req = Request(web_full, headers={'User-Agent': 'Mozilla/5.0'})
    
    
    
    
    
    #open the webpage and read
    page = urlopen(req).read()
    
    #using beautiful soup turn the html of the page into a soup format 
    soup = BeautifulSoup(page, 'html.parser')
    
    
    
    #Once we grab the review data, we check it its a value or "none"
    #if it is "none", we get an attribute error which we will except and go on to store "nan"
    try:
        data1 = soup.find('span', attrs={'class': 'oqSTJd'})
        score = data1.text.strip() # strip() is used to remove starting and trailing
        
        dfbs.loc[i,"Review"] = score  #store the score into dfbs
        
    except AttributeError:
        dfbs.loc[i,"Review"] = 0.0 # since no score was available, set score to 0
        
    i +=1

    
#show the first fiew rows of our dataframe
print(dfbs.head())









this worked well however now we have a "review" column containing 0. These 0 values are venues with no reviews/unable to get a review
Furthermore, we have datapoints ("4.0")  that are strings and finally some points which look like this: 8.3/10.

This requires some data cleaning. so we end up with a column of datatype float.

In [None]:
#we know we have some points that have this sort of shape: 8.3/10
#we need to take the 8.3 and divde by 10 then * by 5 to get a out of 5 score. then store this as a float:
#We can use regex to extract everything behind the /10
#some regex to ensure we grab the data in the correct format:
pullrule = '([0-9]+.[0-9]+)' # REGEX RULE: example: 4.8
checkrule = '([0-9]+.[0-9]+/)' # REGEX RULE: example: 4.8/
i = 0

#loop through the Review data
for i in range(len(dfbs["Review"])):
    
    data = str(dfbs.loc[i,"Review"]) #convert data into string so we can use REGEX on it
    is_check = re.match(checkrule, data) #check if "/" exists in data - returns boolean
    
    #check if "/" exists in string:
    if is_check:
        # use pullrule to extract everything behind "/" eg 4.8/10 -> 4.8
        score_fmt = re.findall(pullrule, data)#find the REGEX match, store into score_fmt
        x  = float(score_fmt[0]) / 2
        
        dfbs.loc[i,"Review"] = x #take first element of list + convert to float' and store back into df
        i += 1
    else:     
        dfbs.loc[i,"Review"] = float(data) #convert back to float
        i += 1
    



Now we have a dataframe including review data from google, fully formatted and cleaned:

In [None]:
dfbs.head()

At this point the data analysis has gone beyond the scope of the original data. 
It would be interesting to see how the review data differs from region to region in the town as well as the spread of reviews across the town in general. 

One way this could be done is by color coding by using a map() function (which i've come across in p5.js) and  map "review" to "RGB COLOR". A such:

def mapping(n, start1, stop1, start2, stop2):
    return ((n-start1)/(stop1-start1))*(stop2-start2)+start2
    
    

So effectivly i will have bluer colours representing venues whos catagory falls within the less common sort. And more redder colours representing venues whos catagory falls within the more common sort (as revealed by our above bar chart)

Upon further reading, follium only allows discrete colors when assigning them to the"fill_circle" parameter. So in this case we could create a dictionary: 

d = {"black":0,
     "darkpurple":1
     "blue":2,
     "green":3,
     "orange":4,
     "red":5,
     
This way we could assign a color based on the rating Then finally plot the map in follium.