# **EDA Simplified: Foursquare Location Matching**

## Introduction
When I've entered this competition over locations and matching them with AI, I remembered this:

> *It's a small world after all, It's a small world after all, It's a small world after all, It's a small, small world...*

And for sure, I have a challenge. Can you identify the quote mentioned in this introduction quickly within 15 seconds?

If you incorrectly guessed this quote or you ran out of time, then better luck next time. But if you guess this quote from the beginning correctly, then you're excellent! you've see that it was from the nursery rhyme, **It's a Small World**! Speaking of that, this small world contained possible locations that you may go, since I am a <span style="color: #7393B3">Tsurezure</span> traveller. But when Foursquare released a competition like this, then this may be a great opportunity to explore around the world. But what's this purpose of this competition? The purpose of this competition is that you’ll match POIs together by using a simulated dataset from Foursquare of over one-and-a-half million Place entries. Using this make you produce an algorithm that predicts which Place entries represent the same point-of-interest. Each Place entry includes attributes like the name, street address, and coordinates. Successful submissions will identify matches with the greatest accuracy. And if you've got this done correctly, then you'll make it easier to identify where new stores or businesses would benefit people the most. *Bada-boom, Bada-bing.*

### Quick Heads-Up: About Foursquare
Before let's proceed to our EDA analysis, let's talk about the creator of this competition, Foursquare. Foursquare is the most trusted, independent location data platform for understanding how people move through the real world. With 12+ years of experience perfecting such methods, Foursquare is the first independent provider of global POI data. The leading independent location technology and data cloud platform, Foursquare is dedicated to building meaningful bridges between digital spaces and physical places. Trusted by leading enterprises like Apple, Microsoft, Samsung, and Uber, Foursquare’s tech stack harnesses the power of places and movement to improve customer experiences and drive better business outcomes. With that, let's move in to EDA.

## Imports
For importing modules in this EDA notebook, there's something special to happen. First, we import geopandas, since this competition is based over location. Then, we import the pandas module as pd for data science and the numpy module as np for linear algebra. We also use plotting modules like: matplotlib with the pyplot module as plt, seaborn as sns, and plotly with the graph_objects submodule as go and express submodule as px.

In [None]:
import geopandas
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

## Dataframe Creation
After importing our stuff, then we create two dataframes, out of pairs.csv file and train.csv file by defining two variables: pairs_df and train_df to put the file paths leading to them in the pd module with the read_csv function.

In [None]:
pairs_df = pd.read_csv("../input/foursquare-location-matching/pairs.csv")
train_df = pd.read_csv("../input/foursquare-location-matching/train.csv")

Yet after that, we display the first 5 rows of our newly created two dataframes by plugging the head function into them!

In [None]:
pairs_df.head()

In [None]:
train_df.head()

## First Peek in this Data
Now, after creating our two dataframes, let's take a quick, quick peek! First, we check whether there's any NaN entities in the two dataframes of pairs_df and train_df by plugging to isna function and the sum function (to calculate how many of them is NaN) to each dataframe.

In [None]:
pairs_df.isna().sum()

In [None]:
train_df.isna().sum()

As for the pairs_df dataframe, we see that the data entities of: 
* address_1 (103,524) 
* city_1 (65,979)
* state_1 (126,591)
* zip_1 (219,398)
* country_1 (8)
* url_1 (347,101)
* phone_1 (308,885)
* categories_1 (16,294)
* address_2 (266,410)
* city_2 (211,417)
* state_2 (269,218)
* zip_2 (354,080)
* country_2 (6)
* url_2 (494,057)
* phone_2 (459,944)
* categories_2 (75,976)

contained some NaN values.

And on the other hand, the train_df dataframe data entities of:
* name (1)
* address (396,621)
* city (299,189)
* state (420,586)
* zip (595,426)
* country (11)
* url (871,088)
* phone (795,957)
* categories (98,307)

also contained some NaN values. Now, let's move on to analyzing the number of observations in both dataframes!

To find the number of observations in our two dataframes, we installed the shape attribute, containing the number 0 enclosed with square brackets to each of our pairs_df and train_df dataframes and print them.

In [None]:
print("No. of observations in pairs_df:", pairs_df.shape[0])

In [None]:
print("No. of observations in train_df:", train_df.shape[0])

As you can see, the number of observations in our pairs_df dataframe is 578,907 entities and the number of observations in our train_df dataframe is 1,138,812 entities. After that, let's use EDA in our Foursquare competition data!

## EDA
Now for our EDA analysis, let's analyze our dataframe "one-at-a-time" in a chapter so that we can explain it in a clear, clear way. So, let's move on and don't dawdle!

### Chapter 1: pairs_df
To see the most countries in the first pair, we define a sub-dataframe variable called country_1 to our pairs_df with the country_1 data attribute along with the to_frame function (to convert the given series of an object to a dataframe, which is the country_1 data in pairs_df) along with the reset_index function (to reset our data indexes) and the rename function, containing the columns parameter that was set to the dictionary with two keys: index and country_1. 

Now, for plotting the countries, it's that simple, if you are using Seaborn. All we need to do is to call the sns module with the displot function, containing the country_1 dataframe with the head function in which it contained the number 20 to indicate the first 20 rows of this dataframe, the x parameter set to country_1, and the y parameter set to count.

In [None]:
country_1 = pairs_df.country_1.value_counts().to_frame().reset_index().rename(columns={'index': 'country_1', 'country_1': 'count'})

sns.catplot(data=country_1.head(20), x='country_1', y='count', kind='bar', aspect=2)

When we observed this graph, we see that the most counts of records in country_1 is the United States of America whil the least counts of records in country_2 is Italy. Now, let's move to analyzing the country_2 data index with Matplotlib! 

Next, let's plot down the data entities of country_2! We continue with Seaborn again, vaguely following of what we did in analyzing country_1 data entities, but we define country_2 variable with displaying out the first 20 rows of this data entity over country_2, thus setting the x parameter in the sns module with the catplot function to country_2.

In [None]:
country_2 = pairs_df.country_2.value_counts().to_frame().reset_index().rename(columns={'index': 'country_2', 'country_2': 'count'})

sns.catplot(data=country_2.head(20), x='country_2', y='count', kind='bar', aspect=2)

Now, when we looked at the two plots, we see that both country_1 and country_2 data entities were almost alike. Now, let's find over the categories!

For finding over the categories, we use Plotly for sure. But first, we vaguely follow of what we do for the counting the countries in the pairs_df dataframe, but with counting the categories_1 with the definition of the categories_1 variable. Then, we define a variable, fig, to the px module with the bar function, containing the categories_1 variable with the first 20 rows by using the head function, x parameter set to categories_1, y parameter set to count. Finally, we show the fig figure variable by using the show variable on the fig variable.

In [None]:
categories_1 = pairs_df.categories_1.value_counts().to_frame().reset_index().rename(columns = {'index':'categories_1', 'categories_1':'count'})

fig = px.bar(categories_1.head(20), x='categories_1', y='count')
fig.show()

Now let's do the same to finding categories_2!

In [None]:
categories_2 = pairs_df.categories_2.value_counts().to_frame().reset_index().rename(columns = {'index':'categories_2', 'categories_2':'count'})

fig = px.bar(categories_2.head(20), x='categories_2', y='count')
fig.show()

As you can see, in the categories_1 graph, the most values is the shopping malls, with 11,606 entities, while the least values is parks, with 3,513 entities.

Meanwhile, in the categories_2 graph, the most values is Residential Buildings (Apartments/Condos) with counts up to 11,604 data entities and the least values is Pharmacies, with counts up to 3,837 data entities.

And now, let's plot down the lats and lons of our pairs_df with GeoPandas and Plotly! For finding the lats and lons of the first pair, we convert the pairs_df dataframe into a geo dataframe (gdf) by defining pairs_gdf geo dataframe into the geopandas module with the GeoDataFrame function (to convert a specific dataframe to a geo dataframe), containing the pairs_df dataframe and setting the geometry parameter to the geopandas module with the points_from_xy function (to create geometry points) containing the pairs_df dataframe's longitude_1 and latitude_1 data indexes.

In [None]:
pairs_gdf = geopandas.GeoDataFrame(pairs_df, geometry=geopandas.points_from_xy(pairs_df.longitude_1, pairs_df.latitude_1))

Let's display the first rows of our newly-created pairs_gdf geo dataframe by using the head function!

In [None]:
pairs_gdf.head()

Let's now plot over the first pair of coords over Geopandas' built-in map! We define another variable, world, to the geopandas module with the read_file function to read out the file given, containing the geopandas module again but with the datasets attribute along with the get_path function to get the path directory, containing natural_earth_lowres. Furthermore, we define another variable, ax, to the world variable figure and plug it with the plot function to jot down our geo figure, setting the color parameter to white and the edgecolor set to black. We then use the plot function to the pairs_gdf geo dataframe to plot down the coordinates, setting the ax parameter to ax and color parameter set to green. Finally, we show the figure with the show function to the plt module.

In [None]:
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
ax = world.plot(color="white", edgecolor="black")
pairs_gdf.plot(ax=ax, color='red')
plt.show()

When we ran this plot above, we see that the plots clampered together in North America, South America, Europe, and parts of Asia. We also saw some plots of POIs in Antarctica, since we observed four plots in there. So, let's heatmap it with Folium!

Before we sizzle up plotting the heatmaps, we need to import three important modules: folium, plugins from folium itself, and HeatMap from folium with the plugins module. 

After that, let's get on to heat up the heatmaps! First, we convert the first pair of latitudes and longitudes of our pairs_df dataframe by redefining them to themself with the astype function, containing the float boolean. 

Next, we define another variable, heat_data_one to an array containing another array of the row dataframe with the data indexes of latitude and longitude, since the row dataframe along with the index variable looped in the row iteration in pairs_df dataframe with the iterrows function.

Then, we create another variable, basemap to the folium module with the Map function to create a whole interactive map, setting the location parameter to an array of any two numbers for indicating the current location and the zoom_start parameter to 2 for zooming in to the starting point. We also call out the HeatMap function to apply the heatmaps of the heat_data_one dataframe, setting the radius parameter to 10 and the blur parameter to 5 and apply it with the add_to function, containing the basemap. Finally, we call out the basemap variable.

In [None]:
import folium
from folium import plugins
from folium.plugins import HeatMap

pairs_df["latitude_1"] = pairs_df["latitude_1"].astype(float)
pairs_df["longitude_1"] = pairs_df["longitude_1"].astype(float)

heat_data = [[row['latitude_1'], row['longitude_1']] for index, row in pairs_df.iterrows()]

In [None]:
basemap = folium.Map(location=[63, -38], zoom_start=2)
HeatMap(heat_data, radius=10, blur=5).add_to(basemap)
basemap

When you zoom out of the interactive "Leaflet" map output, you'll see that most heatmaps took place in most of North America, South America, Africa, Europe, and parts of Asia and Oceania. Thus, we see some heatmaps in Antarctica and in the Arctic. It's kinda like the same as Geopandas in our analysis.

Now let's try plotting and heatmapping again with Folium and Geopandas, but this time, it's for the second pair of lats and lons. It's vaguely based on what we did on plotting the first pair of lats and lons, but the variables and dataframes were defined differently over the pairs_df dataframe with the data indexes of "latitude_2" and "longitude_2".

In [None]:
pairs_gdf_2 = geopandas.GeoDataFrame(pairs_df, geometry=geopandas.points_from_xy(pairs_df.longitude_2, pairs_df.latitude_2))
pairs_gdf_2

In [None]:
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
ax = world.plot(color="white", edgecolor="black")
pairs_gdf_2.plot(ax=ax, color='blue')
plt.show()

In [None]:
pairs_df["latitude_2"] = pairs_df["latitude_2"].astype(float)
pairs_df["longitude_2"] = pairs_df["longitude_2"].astype(float)

heat_data = [[row['latitude_2'], row['longitude_2']] for index, row in pairs_df.iterrows()]

In [None]:
basemap = folium.Map(location=[63, -38], zoom_start=2)
HeatMap(heat_data, radius=10, blur=5).add_to(basemap)
basemap

After running four sets of code cells, we merely saw a difference that some first pairs of lats and lons aren't the same as the second pairs of lats and lons. Speaking of which, we may find out the number of True and False matches by plotting them, on Seaborn again.

For counting the true and false matches on Seaborn, we use the sns module with the catplot function to make another categorical plot, setting the data parameter to the pairs_df dataframe, the x parameter set to "match", and the kind parameter to set to "count".

In [None]:
sns.catplot(data=pairs_df, x="match", kind="count")

As always, there are almost 400K data entities saying that the pairs are True, and 180K of data entities saying that the pairs are False. In that case, there are more True data entities than False data entities.

Since we're done covering the chapter with our EDA of the pairs_df dataframe, let's cover with EDA on the next chapter over the train_df dataframe! 

### Chapter 2: train_df
Since we've done analyzing the data in the pairs_df dataframe, let's go on to analyzing the train_df dataframe!

To see the number of countries specified in the train_df dataframe, we define a variable, country, to what we did for analyzing the pairs_df dataframe but over the train_df dataframe that has the country data attribute.

Then, we use the sns module with the catplot function, setting six parameters inside: x to country, y to count, data to the country dataframe with the head function displaying out the first 20 data entities, kind set to bar, color set to the first letter of the color name, and aspect set to 2.

In [None]:
country = train_df.country.value_counts().to_frame().reset_index().rename(columns={'index': 'country', 'country': 'count'})

sns.catplot(x="country", y="count", data=country.head(20), kind="bar", color="b", aspect=2)

When we examined the graph carefully, we made a good observation of this graph's difference to the pairs_df country_1 and country_2 data graphs. The most data entries for country data in train_df is the United States, while the least data entries is Saudi Arabia instead of Italy.

Now, let's find out the number of categories in this train_df dataframe we're in!

First, before plotting down the number of categories with Plotly, we follow the same thing again and again, displaying the top 20 of the categories data index in the train_df dataframe.

We then define the variable fig again but this time, defining the fig variable to the go module (from the plotly main module with the graph_objects submodule) with the Figure function to create a new Plotly Go figure, setting the data parameter to the go module again with the Bar function to create a bar chart, setting four parameters inside: x set to the categories with the categories data index set to display the first 20 data entities, y set to the categories with the count data index set to display the first 20 data entities, text set to same as what we did to setting up the y parameter, and textposition set to auto. 

Furthermore, we can customize the aspect of our bar chart figure by using the update_traces to the fig variable, setting the marker color to any rgb-formatted color, marker_line_color set to any rgb-formatted color, marker_line_width set to 1.5, and opacity set to 0.6. Finally, we show our bar figure by using the show function to the fig variable.

In [None]:
categories = train_df.categories.value_counts().to_frame().reset_index().rename(columns = {'index':'categories', 'categories':'count'})

fig = go.Figure(data=[go.Bar(
            x=categories["categories"].head(20), y=categories["count"].head(20),
            text=categories["count"].head(20),
            textposition='auto',
        )])

fig.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)

fig.show()

Unlike the categories_1 and (like) categories_2 data graphs from the pairs_df dataframe analysis, the most data entries is Residential Buildings while the least is Bars.

Without further ado, let's head on to geo-plotting, with Geopandas and Folium!

To geo-plot the data from the train_df dataframe with Geopandas, we define a new geodataframe called train_gdf to the geopandas module that has the the GeoDataFrame function to create a new geo-df, applying the train_df dataframe inside, and setting the geometry parameter to the geopandas module that has the points_from_xy function for creating the coordinates, setting the lats and lons with the train_df dataframe that has the longitude and latitude data attributes.

In [None]:
train_gdf = geopandas.GeoDataFrame(train_df, geometry=geopandas.points_from_xy(train_df.longitude, train_df.latitude))

Now let's see the train_gdf GeoDataFrame!

In [None]:
train_gdf.head()

We now have the geometry data on the train_gdf geodataframe. With that, let's plot the points from it!

First, we define a variable, world, to the geopandas module with the read_file function to read the file specified, which is the geopandas module again with the datasets attribute along with the get_path function, setting the naturalearth_lowres built-in path file.

Next, we define another variable, ax, to the world variable and apply the plots with the plot function, setting the color parameter to white and the edgecolor parameter to black. We then use the plot function again to the train_gdf dataframe, setting the ax parameter to ax and the color parameter to any color. Finally, we show our geo-plotted figure by using the plt module with the show module.

In [None]:
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
ax = world.plot(color="white", edgecolor="black")
train_gdf.plot(ax=ax, color='purple')
plt.show()

Like to our plotting analysis of the first and second lats and lons from the pairs_gdf geodataframe, we see that most points were clamped up together in North America, South America, Europe, and Southeast Asia. However, we merely see some extra points in the East Pacific, and in Anarctica.

Without a doubt, let's proceed to heatmapping! Before we start heatmapping, we re-define the train_df dataframe that has the latitude and longitude data to itself converting to a float with the astype function, containing the float boolean. Thus, we define another variable, heat_data to an array containing another array that held the row variable that has the latitude and longitude data indexes, and on outside of that, a for loop was made, looping the index and row variables in the train_df dataframe with the iterrows function to iter every row of it.

Now let's create the heat in the maps! First, we define a variable, basemap, to the folium module that has the Map function to create a new map, setting the location parameter to an array with any two numbers and the zoom_start parameter to any less number. 

Next, we call the HeatMap function, setting the heat_data inside of it, the radius parameter to 10 and the blur parameter to 5 and then add it to the basemap variable map. Finally, we call out the basemap function.

In [None]:
train_df["latitude"] = train_df["latitude"].astype(float)
train_df["longitude"] = train_df["longitude"].astype(float)
heat_data = [[row["latitude"], row["longitude"]] for index, row in train_df.iterrows()]

In [None]:
basemap = folium.Map(location=[54, -39], zoom_start=1)
HeatMap(heat_data, radius=10, blur=5).add_to(basemap)
basemap

After analyzing this heatmap above, we made a correlation saying that the heatmap is almost similar to what we plotted out in GeoPandas!

And with that, We've done chapter 2 of the train_df dataframe!

## Conclusion
We did it! We used EDA on the Foursquare competition data and travel around the world with that! However, we asked another question, "Is our world a small world after all?" We may never know, since our world has a vast majority of a large number of POIs.