# Segmenting and Clustering Airbnb Listings in Berlin, Germany¶


## Data Science  Project

Mohamed Abdul Fatah



# 1 Introduction/Business Problem

### Airbnb has successfully disrupted the traditional hospitality industry as more and more travelers decide to use Airbnb as their primary accommodation provider. Since its beginning in 2008, Airbnb has seen an enormous growth, with the number of rentals listed on its website growing exponentially each year. In Germany, no city is more popular than Berlin. That implies that Berlin is one of the hottest markets for Airbnb in Europe, with over 22,552 listings as of November 2018.

#### Although Airbnb listings provide enough information about the shared space, there is less information about the nearby location. For example, travelers might be interested in what kind of venues are close to the accommodation they book.In addition, travelers cannot filter Airbnb listings based on the nearby venues. In other words, each time travelers make a search for an accommodation using the Airbnb community, they may want to get direct information about the venues in the area and a list of similar Airbnb listings with same venue categories nearby.

### The main objective of this project is to explore, segment and cluster Airbnb listings in Berlin, Germany. I will use the Foursquare API to explore the areas around the Airbnb listings in Berlin. I will use the explore function to get the most common venue categories for each Airbnb listing, and then use this feature together with the prices to group the listings into clusters. I will use the k-means clustering algorithm to complete this task. Finally, I will use the Folium library to visualize the listings in Berlin and their emerging clusters.

# Data

## * Foursquare API: to get the most common venues of given Airbnb listing.

## * Airbnb Data Collection : Here is the data provided for each Airbnb listing. Each link downloads a zip file of the data for a named city or region; in my case this is Berlin, Germany. The zip file holds one or more csv files. Each csv file represents a single ”survey” or ”scrape” of the Airbnb web site for that city. The data is collected from the public Airbnb web sit. Each csv file contains the attributes as follows:

### * room_id: A unique number identifying an Airbnb listing

### * host_id: A unique number identifying an Airbnb host.

### * room_type: One of ”Entire home/apt”, ”Private room”, or ”Shared room” borough: A sub-region of the city or search area for which the survey is carried out. The borough is taken from a shapefile of the city that is obtained independently of the Airbnb web site. For some cities, there is no borough information; for others the borough may be a number.

### * neighborhood: As with borough: a sub-region of the city or search area for which the survey is carried out. For cities that have both, a neighborhood is smaller than a borough. For some cities there is no neighborhood information.

### * reviews: The number of reviews that a listing has received. Airbnb has said that 70% of visits end up with a review, so the number of reviews can be used to estimate the number of visits. Note that such an estimate will not be reliable for an individual listing (especially as reviews occasionally vanish from the site), but over a city as a whole it should be a useful metric of traffic.

### * overall_satisfaction: The average rating (out of five) that the listing has received from those visitors who left a review. accommodates: The number of guests a listing can accommodate. bedrooms: The number of bedrooms a listing offers. price: The price for a night stay. In early surveys, there may be some values that were recorded by month.

### * minstay: The minimum stay for a visit, as posted by the host.

### * latitude and longitude: The latitude and longitude of the listing as posted on the Airbnb site: this may be off by a few hundred meters.

### * last_modified: the date and time that the values were read from the Airbnb web site

## Airbnb data is used to get the coordinates (latitude and longitude), neighbourhood and price for each listing in Berlin, Germany. Having this information, I can leverage Foursquare API to explore the areas around the Airbnb listings and get the most common venue categories for each listing. Venue categories together with the price are used to segment the listings into similar clusters


## If you don't want the details, just scroll to the end of the project, you'll see a map with average price range per night for each neighborhood and the cluster of quarters with the most common venues displayed 



# Methodology


# 1- Airbnb Data Wrangling: Clean and Transform

In [None]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes}
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

In [None]:
#import numpy as np # library to handle data in a vectorized manner
import time
#import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#import folium # map rendering library
import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import seaborn as sns

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')

# Get the Airbnb data of listings in Berlin  

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

### Get the Airbnb data of listings in Berlin

In [None]:
# load data
df_s = pd.read_csv('/kaggle/input/berlin-airbnb-data/listings_summary.csv')


In [None]:
df_s.head()

In [None]:
df_s.shape

In [None]:
df_s.columns

## let's get important feature

In [None]:
columns_to_keep = ['id','host_has_profile_pic','host_since',
                   'latitude', 'longitude','property_type', 'room_type', 'accommodates', 'bathrooms',  
                   'bedrooms', 'bed_type', 'amenities', 'price', 'cleaning_fee',
                   'security_deposit', 'minimum_nights',  
                   'instant_bookable', 'cancellation_policy','availability_365','neighbourhood_cleansed','neighbourhood_group_cleansed']
df_s= df_s[columns_to_keep].set_index('id')

In [None]:
df_s.shape

## cheak missing data 

In [None]:
total = df_s.isnull().sum().sort_values(ascending = False)
percent = (df_s.isnull().sum()/df_s.isnull().count()*100).sort_values(ascending = False)
missing_df_s  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_df_s.head(3)

## let's convert string true and false to numeric

In [None]:
#Convert f,t to 0 or 1
df_s['instant_bookable'] = df_s['instant_bookable'].map({'f':0,'t':1})
#fill f for N/A in host_has_profile_pic column for further correct mapping
set(df_s['host_has_profile_pic'])
df_s['host_has_profile_pic'].fillna('f',inplace=True)
#Convert f,t to 0 or 1
df_s['host_has_profile_pic'] = df_s['host_has_profile_pic'].map({'f':0,'t':1})


## let's Remove $ from price

In [None]:
#Remove $ from price, fee columns and convert to float
df_s['price'] = df_s['price'].str.replace('$', '').str.replace(',', '').astype(float)
df_s['cleaning_fee'] = df_s['cleaning_fee'].str.replace('$', '').str.replace(',', '').astype(float)
df_s['security_deposit'] = df_s['security_deposit'].str.replace('$', '').str.replace(',', '').astype(float)


## let's fill nan values with median

In [None]:
#cleaning_fee cleanup of N/a replace with median value
df_s['cleaning_fee'].fillna(df_s['cleaning_fee'].median(), inplace=True)
#security_deposit cleanup of N/a replace with median value
df_s['security_deposit'].fillna(df_s['security_deposit'].median(), inplace=True)
#cleanup bathroom , bedroom columns
df_s['bathrooms'].fillna(1,inplace=True)
df_s['bedrooms'].fillna(1,inplace=True)

## let's cheak missing data 

In [None]:
total = df_s.isnull().sum().sort_values(ascending = False)
percent = (df_s.isnull().sum()/df_s.isnull().count()*100).sort_values(ascending = False)
missing_df  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_df.head(3)

## lets remove any outliers

In [None]:
#Check distribution of price column
df_s.describe()


In [None]:
df_s['price'].describe()


# * 75% of prices are near 70 $ so i will drop prices above 180 and also drop the min price like 0 & 1

# * min prices is 0

# * max prices is 9000 $



In [None]:
df_s.drop(df_s[ (df_s.price > 180) | (df_s.price == 0) | (df_s.price == 1) ].index, axis=0, inplace=True)

df_s['price'].describe()

#  Explore and Visualize Airbnb Berlin Data

## lets have a look at airbnb rent price statistics

In [None]:
# boxplot of price column
red_square = dict(markerfacecolor='r', markeredgecolor='r', marker='.')
df_s['price'].plot(kind='box', xlim=(0, 180), vert=False, flierprops=red_square, figsize=(15,3));


In [None]:
import seaborn as sns
sns.distplot(df_s['price'],bins=15)

## Airbnb rent price statistics shows The price ranges are between 25 to 75 $ 

## Price range distribution among the room types

In [None]:
sns.boxplot(x='room_type',y='price',data = df_s)
plt.show()

## Price range distribution among the bedrooms

In [None]:
sns.boxplot(x='bedrooms', y= 'price', data=df_s)

## We notice here that the more rooms there are, the more price increases

## let's have a look at number of Listingsin in each Neighbourhood group 

In [None]:
nh = df_s['neighbourhood_group_cleansed'].value_counts().reset_index()
nh.columns = ['neighbourhood_group_cleansed' ,'Count']
nh['Percent'] = nh['Count']/nh['Count'].sum() * 100
nh.head()

## Friedrichshain-Kreuzberg Neighbourhood group seems to have more Listings in the dataset, with 24%

## let's have a look at the frist five popular neighborhoods among the listings?

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
ax = sns.countplot(x="neighbourhood_group_cleansed", hue="neighbourhood_group_cleansed", data=df_s,
              order=df_s['neighbourhood_group_cleansed'].value_counts().iloc[:5].index)
plt.title('Popular Neighborhoods')
plt.ylabel('Count')
plt.xlabel('neighbourhood_group_cleansed')
plt.show()


## We notice here that the most popular neighborhoods are each following in order from the first to the fifth:
1 Friedrichshain-Kreuzberg

2 Mitte

3 Neukölln

4 Pankow	

5 Charlottenburg-Wilm

## let's have a look at the most occupied room types among the listings?

In [None]:
ax = sns.countplot(x="room_type", data=df_s)
plt.title('Room Type distribution')
plt.xlabel('Room Type')
plt.ylabel('Frequency')
plt.show()

## lets'have a look at the Room type distribution in the neighborhood groups

In [None]:
plt.figure(figsize=(10,10))
ax = sns.countplot(x="room_type", data=df_s,hue="neighbourhood_group_cleansed")


## Lets break up amenties that will help in drawing a correlation to price better as amenties might impact price


In [None]:
#nominal_categorical bed_type and property_type
for i in ["bed_type","property_type","cancellation_policy"]:
    x = df_s[[i]]
    x.room_type = pd.Categorical(x[i])
    del df_s[i]
    dummies = pd.get_dummies(x, prefix = i)
    df_s = pd.concat([df_s,dummies], axis=1)

    df_s.head(3)

In [None]:
df_s['Laptop_friendly_workspace'] = df_s['amenities'].str.contains('Laptop friendly workspace')
df_s['TV'] = df_s['amenities'].str.contains('TV')
df_s['Family_kid_friendly'] = df_s['amenities'].str.contains('Family/kid friendly')
df_s['Host_greets_you'] = df_s['amenities'].str.contains('Host greets you')
df_s['Smoking_allowed'] = df_s['amenities'].str.contains('Smoking allowed')
df_s['Hot_water'] = df_s['amenities'].str.contains('Hot water')
df_s['Fridge'] = df_s['amenities'].str.contains('Refrigerator')

In [None]:
df_s['No_of_amentities'] = df_s['amenities'].apply(lambda x:len(x.split(',')))

## dropping amenities as we have inferred above as different categories

In [None]:
# dropping amenities as we have inferred above as different categories
dropped = ['amenities']
df_s.drop(dropped,axis=1,inplace=True)

## Convert false,true to 0 or 1

In [None]:
#Convert false,true to 0 or 1
df_s['Laptop_friendly_workspace'] = df_s['Laptop_friendly_workspace'].astype(int)
df_s['TV'] = df_s['TV'].astype(int)
df_s['Family_kid_friendly'] = df_s['Family_kid_friendly'].astype(int)
df_s['Host_greets_you'] = df_s['Host_greets_you'].astype(int)
df_s['Smoking_allowed'] = df_s['Smoking_allowed'].astype(int)
df_s['Hot_water'] = df_s['Hot_water'].astype(int)
df_s['Fridge'] = df_s['Fridge'].astype(int)

## Lets also calculate distances from city center,airport and railway station that will again help in drawing a correlation to price

### from city center

In [None]:
#Calculate distance from central berlin
def haversine_distance_central(row):
    berlin_lat,berlin_long = radians(52.5200), radians(13.4050)
    R = 6373.0
    long = radians(row['longitude'])
    lat = radians(row['latitude'])
    
    dlon = long - berlin_long
    dlat = lat - berlin_lat
    a = sin(dlat / 2)**2 + cos(lat) * cos(berlin_lat) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    return R * c

### from airport

In [None]:
#Calculate distance from airport
def haversine_distance_airport(row):
    berlin_lat,berlin_long = radians(52.3733), radians(13.5064)
    R = 6373.0
    long = radians(row['longitude'])
    lat = radians(row['latitude'])
    
    dlon = long - berlin_long
    dlat = lat - berlin_lat
    a = sin(dlat / 2)**2 + cos(lat) * cos(berlin_lat) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    return R * c

## from berlin railway station

In [None]:
#Calculate distance from berlin railway station
def haversine_distance_rail(row):
    berlin_lat,berlin_long = radians(52.5073), radians(13.3324)
    R = 6373.0
    long = radians(row['longitude'])
    lat = radians(row['latitude'])
    
    dlon = long - berlin_long
    dlat = lat - berlin_lat
    a = sin(dlat / 2)**2 + cos(lat) * cos(berlin_lat) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    return R * c

In [None]:
from math import sin, cos, sqrt, atan2, radians

In [None]:
df_s['distance_central'] = df_s.apply(haversine_distance_central,axis=1)
df_s['distance_airport'] = df_s.apply(haversine_distance_airport,axis=1)
df_s['distance_railways'] = df_s.apply(haversine_distance_rail,axis=1)
df_s['distance_avg'] = ( df_s['distance_central'] + df_s['distance_airport'] + df_s['distance_railways'] )/3.0


## Now we are ready to see price is dependent on how many factors for top 1000 properties; so first I will sort by price descending and then generate a correlation matrix 

In [None]:
df_s.sort_values(by='price',ascending=False,axis=0,inplace=True) #sorting frame by price desc

In [None]:
df_top10000 = df_s.head(10000)
df_top1000 = df_s.head(1000)

In [None]:
import seaborn as sns 
import matplotlib.pyplot as plt

In [None]:
sns.set(style="white")
corr = df_s.corr()

# generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# set up the matplotlib figure
fig, ax = plt.subplots(figsize=(25, 15))

# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink":.5},cbar=True);

## price seems to depend largely on following factors

## * No. of ameneties

## * Is it family or kids friendly

## * Cleaning fee

## * how many guests it can accomodate

## * price is not much dependent on distance

# I create also a map for the different Airbnb listing in Berlin. To do that, I work with Folium, a Python visualization library, solely developed for visualizing geospatial data. I use Folium library to visualize geographic details of Berlin and the Airbnb listings and I create a map of Berlin with Airbnb listings superimposed on top. I use latitude and longitude values to get the map below:

## let's plot all top 1000 properties and check where they are concentratedon a map [central berlin, railway station or airport]

In [None]:
# Setting a base map
lat = 52.509
long = 13.381
base = folium.Map(location=[lat,long], zoom_start=12) #base map setting
base

In [None]:
neighbourhoods = folium.map.FeatureGroup()

In [None]:
lat_long_list = [[52.520,13.405],[52.373,13.506],[52.507,13.332]] #locatioms of central berlin , railway stn, airport

In [None]:
for i in range(0,len(lat_long_list)):
    neighbourhoods.add_child(
        folium.CircleMarker(
        lat_long_list[i],
        radius = 16,
        color='yellow',
        fill=True,
        fill_color='red',
        fill_opacity=0.6
        )
    )
base.add_child(neighbourhoods)

In [None]:
for inc_lat,inc_long in zip(df_top1000.longitude,df_top1000.latitude):
    neighbourhoods.add_child(
    folium.CircleMarker(
    [inc_long,inc_lat],
    radius = 6,
    color='red',
    fill=True,
    fill_color='yellow',
    fill_opacity=0.6
    )
)
base.add_child(neighbourhoods)

## the map plot indicates that the top 1000 properties are around central berlin railway station and very few near airports
## This is also evident from below distribution plots where properties are mostly around central berlin & railway station

In [None]:
fig = plt.figure(figsize=(10,6))
ax0 = fig.add_subplot(2, 2, 1)
ax1 = fig.add_subplot(2, 2, 2)
ax2 = fig.add_subplot(2, 2, 3)

sns.distplot(df_top1000["distance_central"], bins=10, kde=False,ax=ax0)
ax0.set_title('Distances central berlin to apartments')
ax0.set_xlabel('distance_central')
ax0.set_ylabel('#properties')

sns.distplot(df_top1000["distance_railways"], bins=10, kde=False,ax=ax1)
ax1.set_title('Distances railway station to apartments')
ax1.set_xlabel('distance_railways')
ax1.set_ylabel('#properties')

sns.distplot(df_top1000["distance_airport"], bins=10, kde=False,ax=ax2)
ax2.set_title('Distances airport to apartments')
ax2.set_xlabel('distance_airport')
ax2.set_ylabel('#properties')

plt.subplots_adjust(top = 0.99, bottom=0.01, hspace=0.5, wspace=0.5)
plt.show()

## let's extract the data about neighbourhood_cleansed,neighbourhood_group_cleansed, latitudes and longitudes

In [None]:
Berlin_data = df_s[['neighbourhood_cleansed','neighbourhood_group_cleansed', 'latitude', 'longitude','price']].reset_index(drop=True)
Berlin_data.head()

In [None]:
print('The dataframe has {} neighbourhood_group_cleansed and {} neighborhoods.'.format(
        len(Berlin_data['neighbourhood_group_cleansed'].unique()),
        Berlin_data.shape[0]
    )
)


## let's extract the data about the 5 most popular neighborhood in berlin

In [None]:
Belin_c = df_s[df_s.neighbourhood_group_cleansed.isin(['Mitte','Friedrichshain-Kreuzberg','Pankow','Neukölln','Charlottenburg-Wilm.'])]
    
Belin_c.head()    
    
    

In [None]:
Belin_c.shape

In [None]:
address =  'Berlin, Germany'

geolocator = Nominatim(user_agent="my-application", timeout=10)
Berlin_location = geolocator.geocode(address)

print('The geograpical coordinate of Berlin are {}, {}.'.format(Berlin_location.latitude, Berlin_location.longitude))

In [None]:

Berlin_map=folium.Map(location=[Berlin_location.latitude,Berlin_location.longitude], zoom_start=12)
Berlin_map


# Let's explore the first neighborhood in our dataframe

## Get the neighborhood's latitude and longitude values.


In [None]:
address = 'Berlin' 

geolocator = Nominatim(user_agent="my-application", timeout=10)
paris_location = geolocator.geocode(address)

print('The geograpical coordinate of Berlin are {}, {}.'.format(Berlin_location.latitude, Berlin_location.longitude))

## Define Foursquare Credentials and Version

In [None]:
CLIENT_ID = 'C50U5PW0KUQFAYG3VW3C3OTKWLKYAMDVDPEKKC3COOAML32M' # your Foursquare ID
CLIENT_SECRET = 'FQW0AQA0PF52RSL5ZQ3YSHMI2O4QQWYDGVTC5HJ2WFCTO4VI' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

## First, let's create the GET request URL

In [None]:
LIMIT = 100 # limit of number of venues returned by Foursquare API


radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    Berlin_location.latitude, 
    Berlin_location.longitude, 
    radius, 
    LIMIT)
url # display URL

## Send the GET request and examine the resutls

In [None]:
results = requests.get(url).json()
results

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

## Now we are ready to clean the json and structure it into a pandas dataframe.

In [None]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

## And how many venues were returned by Foursquare?

In [None]:

print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

## Define function to get venues around the neighborhoods

In [None]:
def getNearbyVenues(results, names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            

        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name'],
            v['venue']['id']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category',
                  'Venue id'         ]
    
    return(nearby_venues)

## Get venues data of the neighbourhood_cleansed

In [None]:
results = requests.get(url).json()["response"]['groups'][0]['items']

Berlin_venues = getNearbyVenues(results, names=Belin_c['neighbourhood_cleansed'],
                                   latitudes=Belin_c['latitude'],
                                   longitudes=Belin_c['longitude']
                                  )
           

In [None]:
print(Berlin_venues.shape)
Berlin_venues.head()

## Let's check how many venues were returned for each neighborhood

In [None]:
Berlin_venues.groupby('Neighborhood').count().head()

## Let's find out how many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(Berlin_venues['Venue Category'].unique())))


# Analyze Each Neighborhood

### Encode the Neighborhood

In [None]:
# one hot encoding
Berlin_onehot = pd.get_dummies(Berlin_venues, prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Berlin_onehot['Neighborhood'] = Berlin_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [Berlin_onehot.columns[-1]] + list(Berlin_onehot.columns[:-1])
Berlin_onehot = Berlin_onehot[fixed_columns]

Berlin_onehot.head()

## And let's examine the new dataframe size

In [None]:
Berlin_onehot.shape

## Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
Berlin_grouped = Berlin_onehot.groupby('Neighborhood').mean().reset_index()
Berlin_grouped.head()

## Let's confirm the new size

In [None]:
Berlin_grouped.shape

## Let's print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in Berlin_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = Berlin_grouped[Berlin_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

## Let's put that into a pandas dataframe
First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

## Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Berlin_grouped['Neighborhood']

for ind in np.arange(Berlin_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Berlin_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

## let's Count the number of venues for each venue category in the neighborhoods
## top 5 most common venues

In [None]:
df_no_top_venue = Berlin_venues[['Neighborhood','Venue Category','Venue']].groupby(['Neighborhood','Venue Category']).count()
df_no_top_venue.head()

# Cluster Neighborhoods
## Run k-means to cluster the neighborhood into 3 clusters.

In [None]:
# set number of clusters
kclusters = 3
Berliln_grouped_clustering = Berlin_grouped.drop('Neighborhood', 1)
Berliln_grouped_clustering.head()
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Berliln_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
Berliln_grouped_clustering.head()

In [None]:
kmeans.labels_

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Berlin_merged = df_s

# merge Berlin_grouped with Berlin_data to add latitude/longitude for each neighborhood
Berlin_merged = Berlin_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='neighbourhood_cleansed')
Berlin_merged.head() # check the last columns!

## Finally, let's visualize the resulting clusters

In [None]:
latitude=52.5170365
longitude=13.3888599
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Berlin_merged['latitude'], Berlin_merged['longitude'], Berlin_merged['neighbourhood_cleansed'], Berlin_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)

       
map_clusters

# Examine Clusters
## i will examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories

##  Cluster 1

In [None]:
cluster1 = Berlin_merged.loc[Berlin_merged['Cluster Labels'] == 0, Berlin_merged.columns[[0,1,2,3] + list(range(4, Berlin_merged.shape[1]))]]
print(cluster1.shape[0], "neighborhood(s) in Cluster 1")
cluster1.head()

 ## Cluster 2

In [None]:
cluster2 = Berlin_merged.loc[Berlin_merged['Cluster Labels'] == 1, Berlin_merged.columns[[0,1,2,3] + list(range(4, Berlin_merged.shape[1]))]]
print(cluster2.shape[0], "neighborhood(s) in Cluster 2")
cluster2.head()

## Cluster 3

In [None]:
cluster3 = Berlin_merged.loc[Berlin_merged['Cluster Labels'] == 2, Berlin_merged.columns[[0,1,2,3] + list(range(3, Berlin_merged.shape[1]))]]
print(cluster3.shape[0], "neighborhood(s) in Cluster 3")
cluster3.head()

****# Name the clusters
## 1. Cluster 1 could be "Restaurant & Bar"

## 2. Cluster 2 could be " Lots of Hotel"

## 3. Cluster 3 could be "Diverse Entertainment"



# Discussion and Recommendations
## k-means partitioned the Airbnb listings into 5 groups since we specified the algorithm to generate 5 clusters. The Airbnb listings in each cluster are similar to each other in terms of the features included in the dataset.

## I check the centroids values by averaging the features and get the top most common venues in each cluster.



****
# Conclusion

## I have combined Airbnb listings and Foursquare data to provide useful information to travelers in Berlin about the location and the most common venues they can visit in an area of 500 meters around their accommodation.

## I have grouped the Airbnb listings around Berlin lake into 5 clusters based on similar venues and price levels. Travelers could leverage the clusters to filter listings according to their price preferences and the most common venues. In other words, travelers could search Airbnb listings according to location or venues they would like to visit, close to their accommodation.

The project is available on GitHub [5]

# This is the end of project