# Capstone Project


# Finding best locations for an advertisement campaign in Istanbul, Turkey

## The Description of the Problem
**ABC Limited** is a touristic company in Istanbul, Turkey. It wants to start an advertisement campaign for its touristic tours targeting the tourists in Istanbul traveling without a previously arranged tour programs. 

Its budget can cover the advertisements in only limited areas.  We want to use the spatial data for Istanbul City and analyze it in order to infer the most populated neighborhoods by tourists, that contain tourists who are willing to pay for the tours.

## The Description of the Data and how it will be used to solve the problem
We will use the data from “Inside Airbnb site” which is sourced from publicly available information from the “Airbnb site”.   
http://insideairbnb.com/get-the-data.html  

We can find the detailed listings data for Istanbul in “Inside Airbnb site” from the link:  
http://data.insideairbnb.com/turkey/marmara/istanbul/2018-11-21/data/listings.csv.gz  

Using Airbnb listing data rather hotels listing data has an advantage that most of the hotels’ clients are coming to Istanbul in groups with pre-arranged tour programs, where as the clients of Airbnb are usually arranging their own travels, accommodations, tour programs.
We want to choose the five most prominent zipcodes of the most populated neighborhood by tourists, based on the distributions of the Airbnb properties and the per-night prices of them. 

### Importing the required libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim 

!pip install folium
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata: ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


### Importing the data

In [2]:
df = pd.read_csv('http://data.insideairbnb.com/turkey/marmara/istanbul/2018-11-21/data/listings.csv.gz', compression='gzip', 
                 index_col=0, low_memory=False)

### Exploring the data

In [3]:
df.shape

(14927, 95)

In [4]:
df.columns

Index(['listing_url', 'scrape_id', 'last_scraped', 'name', 'summary', 'space',
       'description', 'experiences_offered', 'neighborhood_overview', 'notes',
       'transit', 'access', 'interaction', 'house_rules', 'thumbnail_url',
       'medium_url', 'picture_url', 'xl_picture_url', 'host_id', 'host_url',
       'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'street',
       'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedr

### Subseting the relevant columns

In [5]:
df1 = df[['name', 'neighbourhood', 'zipcode', 'latitude', 'longitude', 'accommodates', 'price']]

In [6]:
df1.head()

Unnamed: 0_level_0,name,neighbourhood,zipcode,latitude,longitude,accommodates,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4826,The Place,Üsküdar,34684,41.056499,29.053674,2,$562.00
20815,The Bosphorus from The Comfy Hill,Beşiktaş,34345,41.069842,29.045452,3,$102.00
25436,House for vacation rental furnutare,Beşiktaş,34400,41.077312,29.038906,3,$214.00
27271,LOVELY APT. IN PERFECT LOCATION,Cihangir,34433,41.032195,28.982163,2,$182.00
28277,Duplex Apartment with Terrace,Şişli,34373,41.044708,28.985674,4,$605.00


### Exploring the data (cont.) 

In [7]:
print('The dataframe has {} neighbourhoods.'.format(
        len(df1['neighbourhood'].unique())
    )
)

The dataframe has 16 neighbourhoods.


In [8]:
print(df1['neighbourhood'].unique())

['Üsküdar' 'Beşiktaş' 'Cihangir' 'Şişli' 'Beyoglu' 'Taksim' nan 'Karaköy'
 'Kadıköy' 'Eminönü' 'Sultanahmet' 'Moda' 'Kadıköy Merkezi' 'Fatih'
 'Ortaköy' 'Aksaray']


In [9]:
df1.shape

(14927, 7)

### Analyzing the data

Now, we explore the neighbourhoods in Istanbul city, to determine which area has the most number of properties listings.

In [10]:
neighbourhood_tab = pd.crosstab(index=df1["neighbourhood"], columns="count").sort_values('count', ascending=False)
neighbourhood_tab

col_0,count
neighbourhood,Unnamed: 1_level_1
Şişli,2010
Taksim,1717
Sultanahmet,1335
Beşiktaş,1217
Cihangir,1107
Karaköy,874
Kadıköy,827
Üsküdar,574
Moda,503
Fatih,452


We see that "Şişli" has the greatest number of properties listings in the dataset.

### Subseting the most populated neighborhood by tourists

In [11]:
df2 = df1[df1['neighbourhood'] == 'Şişli']

### Exloring the subseted dataset

In [12]:
df2.shape

(2010, 7)

In [13]:
df2.head()

Unnamed: 0_level_0,name,neighbourhood,zipcode,latitude,longitude,accommodates,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
28277,Duplex Apartment with Terrace,Şişli,34373.0,41.044708,28.985674,4,$605.00
33368,Deluxe double bedroom @ Nisantasi,Şişli,34365.0,41.053821,28.997393,2,$305.00
87024,Nisantasi Studio Apartment,Şişli,,41.050176,28.990152,2,"$1,526.00"
146854,Beautiful & Super Deluxe Flat,Şişli,34400.0,41.04753,28.980305,1,$321.00
175766,at the center of İstanbul..,Şişli,34387.0,41.070053,28.985606,2,$161.00


### Cleaning the dataset

We want to prepair the dataset so we can apply some analysis for its values. 

First let us explore the datatypes of its columns.

In [14]:
df2.dtypes

name              object
neighbourhood     object
zipcode           object
latitude         float64
longitude        float64
accommodates       int64
price             object
dtype: object

We want to deal with the "price" column to remove the dolar sign and considet it as float values.

In [15]:
df2['price'].head()

id
28277       $605.00
33368       $305.00
87024     $1,526.00
146854      $321.00
175766      $161.00
Name: price, dtype: object

In [16]:
df2['price'] = df2.price.str.replace("\$|,",'').astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


After using regular expresions to do the transformation, we notice that the column is as requested now.

In [17]:
df2['price'].head()

id
28277      605.0
33368      305.0
87024     1526.0
146854     321.0
175766     161.0
Name: price, dtype: float64

We can also check the data type for the dataset

In [18]:
df2.dtypes

name              object
neighbourhood     object
zipcode           object
latitude         float64
longitude        float64
accommodates       int64
price            float64
dtype: object

Since the size of the dataset is (2010, 7) as we have seen above.   

We will to group the data to reduce the amount of the points to be represented on the map as we are interested in the areas not properties themselves.

In [19]:
df3 = df2.groupby('zipcode')['zipcode', 'latitude', 'longitude', 'accommodates', 'price'].mean()

In [20]:
df3.shape

(71, 4)

Now we have more convenient data base to deal with, as we have 71 area zipcodes.

Let us explore the data frame.

In [21]:
df3

Unnamed: 0_level_0,latitude,longitude,accommodates,price
zipcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30433,41.05503,28.979648,4.0,252.0
34000,41.063441,28.990359,2.444444,207.111111
340000,41.049608,28.979862,1.0,102.0
34010,41.053961,28.972479,5.0,643.0
34040,41.055292,28.997757,2.0,305.0
34100,41.065292,28.988944,2.285714,153.142857
34138,41.060833,28.98974,2.923077,816.230769
34200,41.057497,28.990658,2.38806,231.507463
34212,41.074091,28.999491,1.0,134.0
34240,41.060336,28.98515,2.4,242.0


We notice that there are two zipcodes that has been written wrong 340000 and 3434.  


Since zipcodes in Turkey are all 5 digits. We fix them by remving a zero, and adding a zero respectively. 

Let us find out the data type of the 'zipcode' column. 

In [22]:
df2['zipcode'].dtypes

dtype('O')

Let us first replace all NA values in zipcode column by 34000, which is the main zip code for Istanbul. We also change the data type of 'zipcode' column to int32 

In [23]:
df2['zipcode'] = df2.zipcode.fillna(34000).astype('int32')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


We check the 'zipcode' column data type. It is int32 now.

In [24]:
df2['zipcode'].dtypes

dtype('int32')

We locate the 3434 zipcode, and change it by adding 0 to its end.

In [25]:
df2.loc[df2['zipcode'] == 3434]

Unnamed: 0_level_0,name,neighbourhood,zipcode,latitude,longitude,accommodates,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
6987797,"Very central location, sightly,",Şişli,3434,41.074386,28.992794,2,246.0


In [26]:
df2.loc[6987797, 'zipcode'] 

3434

In [27]:
df2.loc[6987797, 'zipcode'] = 34340

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [28]:
df2.loc[6987797, 'zipcode'] 

34340

We locate the 340000 zipcode, and change it by removing 0 from its end.

In [29]:
df2.loc[df2['zipcode'] == 340000]

Unnamed: 0_level_0,name,neighbourhood,zipcode,latitude,longitude,accommodates,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9970104,Kiralık oda,Şişli,340000,41.049608,28.979862,1,102.0


In [30]:
df2.loc[9970104, 'zipcode'] 

340000

In [31]:
df2.loc[9970104, 'zipcode'] = 34000

In [32]:
df2.loc[9970104, 'zipcode'] 

34000

Let us explore zipcodes now.

In [33]:
pd.crosstab(index=df2["zipcode"], columns="count")

col_0,count
zipcode,Unnamed: 1_level_1
30433,1
34000,251
34010,1
34040,1
34100,7
34138,13
34200,67
34212,1
34240,5
34250,94


It seems there is no problem. We can try to solve our problem now.

## Exploring and analyzing the dataframe to find the best locations

We group the data frame again, and explore them. We sort the grouped data descending by price.

In [34]:
df3 = df2.groupby('zipcode')['zipcode', 'latitude', 'longitude', 'accommodates', 'price'].mean()

In [35]:
df3.shape

(69, 5)

In [36]:
df4 = df3.sort_values('price', ascending=False)
df4

Unnamed: 0_level_0,zipcode,latitude,longitude,accommodates,price
zipcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
34510,34510,41.055856,28.984007,5.0,5999.0
36360,36360,41.053761,28.992516,3.0,1039.0
34580,34580,41.057613,28.990124,5.0,916.0
34834,34834,41.079266,28.989058,5.0,830.0
34138,34138,41.060833,28.98974,2.923077,816.230769
34357,34357,41.048746,28.996778,2.5,807.5
34303,34303,41.046403,28.985774,4.0,670.0
34010,34010,41.053961,28.972479,5.0,643.0
34371,34371,41.052596,28.989003,4.304348,634.565217
34373,34373,41.04612,28.985468,3.653595,621.555556


It seems as there is no problem, and we can use the dataframe to subset the five most expensive zipcodes according to the price column.

In [37]:
max_zipcodes = df4[['zipcode']][0:5]
max_zipcodes = max_zipcodes['zipcode'].tolist()
max_zipcodes

[34510, 36360, 34580, 34834, 34138]

We find the targeted zipcodes:  
[34510, 36360, 34580, 34834, 34138]

## Visualizing the data and the results

We use 'geopy' library in order to get the latitude and longitude values of 'Şişli, Istanbul'

In [38]:
address = 'Şişli, Istanbul'

geolocator = Nominatim(user_agent="coursera-capstone-project")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Şişli, Istanbul are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Şişli, Istanbul are 41.061672, 28.9842605962855.


Later, we create a map of Şişli, Istanbul, which is the most populated neighborhood by tourist in Istanbul, according to the data frame. We also add markers to the map in order to highlight the average targeted zipcodes in Şişli. We color the most expensive zipcodes of according to the price-per-night variable by red. The others are colored by blue.

In [39]:
# create map of Istanbul using latitude and longitude values
map_Istanbul = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map
for lat, lng, zipcode in zip(df3['latitude'], df3['longitude'], df3['zipcode']):
    label = '{}'.format(zipcode)
    label = folium.Popup(label, parse_html=True)
    if zipcode in max_zipcodes: 
        color_o = 'red' 
    else: 
        color_o = 'blue'
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=color_o,
        fill=True,
        fill_color=color_o,
        fill_opacity=0.7).add_to(map_Istanbul)  
    
map_Istanbul

Now the tourists company can start its advertisement campaign for its touristic tours in the areas marked by red.