![](https://images.ctfassets.net/exg8oyvb0wfw/6YGSNm9Fz3m8jSkV5c3fIz/7274cc9f7b03f83df6da513c454773f2/FSQ_social_share_image.gif?q=75)

# Training Data


* train.csv - The training set, comprising eleven attribute fields for over one million place entries, together with:
    * id - A unique identifier for each entry.
    * point_of_interest - An identifier for the POI the entry represents. There may be one or many entries describing the same POI. Two entries "match" when they describe a common POI.
* pairs.csv - A pregenerated set of pairs of place entries from train.csv designed to improve detection of matches. You may wish to generate additional pairs to improve your model's ability to discriminate POIs.
* match - Whether (True or False) the pair of entries describes a common POI.

# Example Test Data

To help you author submission code, we include a few example instances selected from the test set. When you submit your notebook for scoring, this example data will be replaced by the actual test data. The actual test set has approximately 600,000 place entries with POIs that are distinct from the POIs in the training set.

* test.csv - A set of place entries with their recorded attribute fields, similar to the training set.
* sample_submission.csv - A sample submission file in the correct format.

    * id - The unique identifier for a place entry, one for each entry in the test set.
    * matches - A space delimited list of IDs for entries in the test set matching the given ID. Place entries always self-match.

In [None]:
pip install country-converter

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import seaborn as sns
import matplotlib.pyplot as plt
import country_converter as coco

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_csv('../input/foursquare-location-matching/train.csv')
test = pd.read_csv('../input/foursquare-location-matching/test.csv')
sample = pd.read_csv('../input/foursquare-location-matching/sample_submission.csv')
pairs = pd.read_csv('../input/foursquare-location-matching/pairs.csv')

# 1. Check data shape

In [None]:
# Size of each csv
train.shape, test.shape, sample.shape, pairs.shape

In [None]:
train.head(3)

# 2. Check data type and missing values

In [None]:
train.info()

In [None]:
missing_df=train.isnull().sum() / train.shape[0] * 100
missing_df=missing_df.sort_values(ascending=False)
missing_df

# 3. Location data : Unique Data Point analysis ![](https://600commerce.com/wp-content/uploads/2020/09/findingpic.png)

In [None]:
round(len(train['point_of_interest'].unique())/len(train['point_of_interest']),2)

**65 % of the data is unique.**

In [None]:
POI_df=train.groupby(train['point_of_interest'].tolist(),as_index=False).size()

In [None]:
POI_df.sort_values('size',ascending=False).head(10)

**Some points of interest (POI) has multiple entries in the data set.**

In [None]:
round(POI_df[POI_df['size']>2]['size'].sum()/POI_df['size'].sum(),2)

**14% of POI has more than 2 entries in the dataset.**

In [None]:
train[train['point_of_interest']=='P_fb339198a31db3'].head(10)

**In the example above, for the same POI there is low consistancy between various attributes making the competion both interesting as well as challenging.**

# 4. Location data : Missing features

In [None]:
plt.figure(figsize=(8,7))
color=["gray"]*len(missing_df.index)
color[0]="aqua"
sns.barplot(x=missing_df.index, y=missing_df.values,palette=color, saturation=.5)#, palette=clrs) # color=clrs)
plt.xticks(rotation=90)
plt.title("% Missing Features")
plt.xlabel('Feature')
plt.ylabel('Percentage')

plt.tight_layout()

# 5. Location data : Availability by country

In [None]:
country_stats=train['country'].value_counts()*100/train['country'].value_counts().sum()
country_stats=country_stats.head(10)

plt.figure(figsize=(8,7))
color=["gray"]*len(country_stats.index)
color[0]="aqua"
sns.barplot(x=country_stats.index, y=country_stats.values,palette=color, saturation=.5)#, palette=clrs) # color=clrs)
plt.title("% Data by Country")
plt.xlabel('country')
_=plt.ylabel('Percentage')

**State and City fields have over 25% data missing. Hence I didn't plot them.**

In [None]:
cc = coco.CountryConverter()

country_stats=train['country'].value_counts()*100/train['country'].value_counts().sum()

country_stats.index = coco.convert(names=country_stats.index, to='ISO3')
country_stats=country_stats.reset_index().rename(columns={'index':'country','country':'% data'})
country_stats.head()

In [None]:
import plotly.graph_objects as go
import pandas as pd

fig = go.Figure(data=go.Choropleth(
    locations = country_stats['country'],
    z = country_stats['% data'],
    text = country_stats['country'],
    colorscale = 'greens',
    autocolorscale=False,
    marker_line_color='lightgray',
    marker_line_width=0.5,
    colorbar_tickprefix = '%',
    colorbar_title = 'Data availability',
))
fig.update_layout(title_text='% Data by Country')
fig.show()

# 6. Location data : Availability by States in the US

In [None]:
state_stats=train[train['country']=='US']['state'].value_counts()*100/train[train['country']=='US']['state'].value_counts().sum()
state_stats=state_stats.head(10)

plt.figure(figsize=(8,7))
color=["gray"]*len(state_stats.index)
color[0]="aqua"
sns.barplot(x=state_stats.index, y=state_stats.values,palette=color, saturation=.5)#, palette=clrs) # color=clrs)
plt.title("% Data by State")
plt.xlabel('State')
_=plt.ylabel('Percentage')

In [None]:
state_stats=train[train['country']=='US']['state'].value_counts()*100/train[train['country']=='US']['state'].value_counts().sum()
state_stats=state_stats.reset_index().rename(columns={'index':'state','state':'% data'})
state_stats.head()

In [None]:
fig = go.Figure(data=go.Choropleth(
    locations=state_stats['state'], # Spatial coordinates
    z = state_stats['% data'].astype(float), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    colorbar_title = 'Data availability (% in the US)',
))

fig.update_layout(
    title_text = '% Data in the US-States',
    geo_scope='usa', # limite map scope to USA
)
fig.show()

# 7. Location data : Category 

In [None]:
category_stats=train['categories'].value_counts()*100/train['categories'].value_counts().sum()
category_stats=train['categories'].value_counts()*100/train['categories'].value_counts().sum()
category_stats=category_stats.head(10)

plt.figure(figsize=(8,7))
color=["gray"]*len(category_stats.index)
color[0]="aqua"
sns.barplot(x=category_stats.index, y=category_stats.values,palette=color, saturation=.5)#, palette=clrs) # color=clrs)
plt.xticks(rotation=90)
plt.title("% Data by category")
plt.xlabel('category')
_=plt.ylabel('Percentage')

In [None]:
train['categories'].value_counts().head(30)

**High probability of the location data being misclasified.**

**For example: - 
One data point could have been labelled as 'Fast Food Restaurants' while another data point could be labelled as Hotels/Restaurants/Chinese Restaurants etc. 
Look for methods to see if a common category tag can be made between various pais of smiliar data points**

![](https://i.pinimg.com/originals/07/17/c9/0717c983090589d4b8f58436a5c720b1.png) **Hope you found this notebook useful!!** 

**I'm  also hoping to publish a ML notebook with a baseline model soon!** 