# Preprocessing Venues
In this notebook, we will find the location of all the venues and include these information in the events

We will use Google Places API.


In [None]:
import pandas as pd
import os
import Scripts.VenueLocationHelper
import numpy as np

This list of events come from the previous preprocessing. See notebooks preprocessing Artist and preprocessing Venue

In [None]:
total_events = pd.read_csv(os.path.join('./Events/total_events_preprocessed.csv'))


## Clean cities' names

First, we clean the city names. For example Geneva can be called Genf, or geneve. We also remove spaces that might still be hiding there. We also transform all the cities to lower case, with their first letter in capital.

In [None]:
total_events = VenueLocationHelper.cleanCity(total_events)

## Clean venues' names
First, the accents, special character and capital letters from the venue names are removed.

Then, we group venues that are the same but with alterities in their names. For example: *Mica club (with DJa)* and *Mica club are* the same venues. 

When we group these data under the same name, we try not to loose any information. Hence if among the data covering the same place, some have already  adress, latitude or longitude set, we will take them, only if they are present in more than 2/3 of the length of this list of similar venues. Indeed, there can always be a problem when taking the coordinates. If the coordinates are not specified, this is no problem and Google Places will take care of it.


In [4]:
total_events = VenueLocationHelper.cleanVenue(total_events)

## Extract unique venues from events

Now, we want to extract a list of all the unique venues in Switzerland. The cleaning from above helped to reduce the list by 10%. As we will see in the final results, this preprocessing is not perfect yet but allows for nice insights already.

In [22]:
# Keep only unique venues and their parameters
total_venues = total_events.drop_duplicates(subset='Venue')
total_venues.drop(['Artist','Date','genre','origin'], axis = 1, inplace = True)
total_venues.reset_index(inplace = True, drop = True)
print('Total number of unique venues in Switzerland :', len(total_venues))
total_venues_na = total_venues[total_venues['Latitude'].isnull()]
print('Total number of unique venues for which we don\'t have the coordinates :', len(total_venues_na))


# Problems when parsing with google 
total_venues.loc[total_venues['City'].str.contains('Ken'), 'City'] = 'Luzern'
total_venues.loc[total_venues['City'].str.contains('Rothis'), 'Adress'] = 'Rothis'
total_venues.loc[total_venues['Venue'].str.contains('Cry der – club daltitude'), 'Adress'] = 'Crans'

total_venues.loc[total_venues['Venue'].str.contains('Cry der – club daltitude'), 'Venue'] = 'Cry der club daltitude'
total_venues.loc[total_venues['Venue'].str.contains('Dancing schonbrunnen'), 'Adress'] = 'Munchenbuchsee'
total_venues.loc[total_venues['Venue'].str.contains('Dancing schonbrunnen'), 'City'] = 'Munchenbuchsee'

total_venues.loc[total_venues['Venue'].str.contains('Mir'), 'Adress'] = 'Oslostrasse 12, Dreispitz'
total_venues.loc[total_venues['Venue'].str.contains('Planet e'), 'Adress'] = 'Ohmweg 10'
total_venues.loc[total_venues['Venue'].str.contains('Provi buerglen'), 'Adress'] = 'Industriestrasse'
total_venues.loc[total_venues['Venue'].str.contains('Tresor club sihlbrugg'), 'Adress'] = 'Industrie Sihlbrugg'
total_venues.loc[total_venues['Venue'].str.contains('Villa foresta'), 'Adress'] = 'Via Villa Foresta'
total_venues.loc[total_venues['Venue'].str.contains('Memphis disco pub'), 'Adress'] = np.nan

Total number of unique venues in Switzerland : 22535
Total number of unique venues for which we don't have the coordinates : 2212


## Extract Latitude and Longitude for all venues using Google Places
To cope with API restrictions, this is done in three times, with three different IP adresses.

In [23]:
api_key = 'AIzaSyAARtrlCcy_KoZhwzHo7K60Gq66fNneTFc'
total_venues1 = total_venues[:8500]
total_venues2 = total_venues[8500:15000]
total_venues3 = total_venues[22528:]


In [None]:
not_found_cter = VenueLocationHelper.getDataGooglePlace(total_venues1,api_key, 1)
percentage_not_found_cter = round(100*not_found_cter/len(total_venues1),2)
print('\n Percentage of data not found : ', percentage_not_found_cter)

In [None]:
not_found_cter = VenueLocationHelper.getDataGooglePlace(total_venues2,api_key, 2)
percentage_not_found_cter = round(100*not_found_cter/len(total_venues2),2)
print('\n Percentage of data not found : ', percentage_not_found_cter)

In [25]:
not_found_cter = VenueLocationHelper.getDataGooglePlace(total_venues3,api_key, 3)
percentage_not_found_cter = round(100*not_found_cter/len(total_venues3),2)
print('\n Percentage of data not found : ', percentage_not_found_cter)

22528 / 22534
To find : Videoex  in  Kanonengasse 20, 8004 Zurich, Zürich, Switzerland
Found : Kunstraum Walcheturm
22529 / 22534
To find : Villa foresta  in  Via Villa Foresta, Pietro, Switzerland
Found : Conwatec S.a g.l.
22530 / 22534
To find : Villa underground  in  Auf dem Wolf 4, 4053 Basel, Basel, Switzerland
Found : Villa Wenkenhof
22531 / 22534
To find : Viscose eventbar  in  Emmenweidstrasse 20, 6020, Emmenbrücke, Switzerland
Found : VISCOSE Bar Lounge Event
22532 / 22534
To find : Xellent club  in  Rue Centrale 17, 3963 Crans-Montana, Crans-montana, Switzerland
Found : Crans-Montana
22533 / 22534
To find : Zapoff  in  Rue de la Vigie 5, 1003 Lausanne, Lausanne, Switzerland
Found : U Bar
22534 / 22534
To find : Zenka  in  Rue de Genève 10, 1003 Lausanne, Lausanne, Switzerland
('Not found',)
Finally, Found : themata

 Percentage of data not found :  0.0


#### Concatenating above results

In [26]:
total_venues3 = pd.read_csv(os.path.join('./Scripts/GooglePlaceData/total_venue_GooglePlace3.csv'))
total_venues3 = total_venues3[total_venues3['Adress'] !='Adress']
filename = 'total_venue_GooglePlace3.csv'
folder = 'GooglePlaceData'
destinationFileName = os.path.join(folder, filename)
pd.DataFrame(total_venues3, columns=list(total_venues3.columns)).to_csv(destinationFileName, index=False, encoding="utf-8")


In [27]:
total_venues = VenueLocationHelper.concatDataVenue()
filename = 'total_venues.csv'
folder = 'Venues'
destinationFileName = os.path.join(folder, filename)
pd.DataFrame(total_venues, columns=list(total_venues.columns)).to_csv(destinationFileName, index=False, encoding="utf-8")
print('Total venues saved to file')
total_venues[total_venues.duplicated(['Venue'])]

Total venues saved to file


Unnamed: 0,Adress,City,Latitude,Longitude,Venue


## Updating total_events with the latitudes/longitudes acquired

Now in the list of events, the latitude and longitude is provided for every event.

In [28]:
total_venues.drop(['Adress'], axis = 1, inplace = True)
total_events.drop(['Adress','City','Latitude','Longitude'], axis = 1, inplace = True)
df_main = total_events.merge(total_venues,on='Venue',right_index=False,how='left')

In [29]:
filename = 'total_eventsFinal.csv'
folder = 'FinalResults'
destinationFileName = os.path.join(folder, filename)
pd.DataFrame(df_main, columns=list(df_main.columns)).to_csv(destinationFileName, index=False, encoding="utf-8")
print('Total events data geo saved to file')

Total events data geo saved to file
