# COS20083 Advanced Data Analytics

## Assignment 2: Case Study and Algorithm Implementation

### Semester 1, 2022

#### Group Number: <p style ="color: red;"> 11</p>
#### Group Members: <p style ="color: red;"> Lim Zong Xin (101232574), Justin Liu Shan Wei (101231403)</p>

# 1. Introduction

### What is the purpose of the assignment?
### What is the problem to be addressed by this case study?

The purpose of the assignment is to is to build a machine learning model to predict which Place entries represent the same point-of-interest. The problem to be addressed in this case study is to match Point-of-Interests using a simulated dataset from Foursquare which contains the Places and movement of customers of over one-and-a-half million Place entries to predict where new stores and businesses will benefit people the most.

# 2. Data Collection

### Describe the purpose and the process of data collection and understanding here

The purpose of performing data collection and understanding is to gather information in a systematic manner to allow data analysis. The csv files used in this assignment consists of train.csv, test.csv, sample_submission.csv and the pairs.csv. Several python libraries were imported and pandas was used to read the csv files using the pd.read_csv() function. The data types present in the dataframes are then shown using the df.info() function as listed below.

In [None]:
# Show your code here (Step by Step) 
# Comment each step in your code

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
#Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import BallTree
from tqdm import tqdm

In [None]:
#Read csv files
df_train = pd.read_csv('../input/foursquare-location-matching/train.csv')
df_test = pd.read_csv('../input/foursquare-location-matching/test.csv')

sample_submission = pd.read_csv('../input/foursquare-location-matching/sample_submission.csv')
pairs=pd.read_csv('../input/foursquare-location-matching/pairs.csv')

In [None]:
#Display data types in train dataset
df_train.info()

In [None]:
#Display data types in test dataset
df_test.info()

In [None]:
#Display data types in sample_submission dataset
sample_submission.info()

In [None]:
#Display data types in pairs dataset
pairs.info()

# 3. Exploratory Data Analysis

Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods

In [None]:
# Show your code here (Step by Step) 
# Comment each step in your code

In [None]:
#Total number of rows and columns in datasets
print(df_train.shape)
print(df_test.shape)
print(sample_submission.shape)
print(pairs.shape)

In [None]:
#Display first few rows of Train dataset
df_train.head()

In [None]:
#Display first few rows of Test dataset
df_test.head()

In [None]:
#Total number of missing values in Train dataset
print(df_train.isnull().sum())

In [None]:
#Missing values in pairs dataset
print(pairs.isnull().sum())

In [None]:
# How many missing values for each example
fig, axs = plt.subplots(1, 2, figsize=(15, 5))
df_train.isna().mean().sort_values(ascending=False).plot(
    kind="bar", title="Missing Values by Variable", ax=axs[0]
)
axs[0].set_ylabel("% of Missing Values")

df_train.isna().sum(axis=1).value_counts().sort_index().plot(
    ax=axs[1], title="Missing Values by Observation", kind="bar"
)

#Plot visualization for missing values by variable
axs[1].set_xlabel("Number of Missing Variable")
axs[1].set_ylabel("Number of Observations")
plt.xticks(rotation=0)
plt.show()

In [None]:
#Display number of unique values for each variable for train dataset
df_train.nunique()

In [None]:
#Display first few rows of Pairs dataframe
pairs.head()

In [None]:
#Statistical summary for Pairs dataset
pairs.describe()

In [None]:
#Plot percentage of data by country in Train dataset
country_stats=df_train['country'].value_counts()*100/df_train['country'].value_counts().sum()
country_stats=country_stats.head(10)

plt.figure(figsize=(8,7))
color=["gray"]*len(country_stats.index)
color[0]="aqua"
sns.barplot(x=country_stats.index, y=country_stats.values,palette=color, saturation=.5)#, palette=clrs) # color=clrs)
plt.title("% Data by Country")
plt.xlabel('country')
_=plt.ylabel('Percentage')

From the graph it can be seen that US has the most data entries

In [None]:
#Plot percentage of data by state in the US
state_stats=df_train[df_train['country']=='US']['state'].value_counts()*100/df_train[df_train['country']=='US']['state'].value_counts().sum()
state_stats=state_stats.head(10)

plt.figure(figsize=(8,7))
color=["gray"]*len(state_stats.index)
color[0]="aqua"
sns.barplot(x=state_stats.index, y=state_stats.values,palette=color, saturation=.5)#, palette=clrs) # color=clrs)
plt.title("% Data by State")
plt.xlabel('State')
_=plt.ylabel('Percentage')

In [None]:
#Plot the most frequent categories in the Train dataset
print(f'There are {df_train["categories"].nunique()} unique categories')

# Take a look at the most frequent categories
df_train["categories"].value_counts().to_frame().query("categories > 5_000")[
    "categories"
].sort_values(ascending=True).plot(
    kind="barh", title="Most Frequent Categories", figsize=(5, 8)
)
plt.show()

In [None]:
#Drop unwanted variables from pairs dataset
pairs = pairs.drop(['address_1','city_1','state_1','zip_1','url_1','phone_1','address_2','city_2','state_2','zip_2','url_2','phone_2'],axis=1)
pairs = pairs.fillna("__nan__")
pairs.head()

In [None]:
#Perform one-hot encoding on the columns containing string variables to fit into model
pairs.country_1 = pairs.country_1.factorize()[0]
pairs.country_2 = pairs.country_2.factorize()[0]
pairs.categories_1 = pairs.categories_1.factorize()[0]
pairs.categories_2 = pairs.categories_2.factorize()[0]
pairs.name_1 = pairs.name_1.factorize()[0]
pairs.name_2 = pairs.name_2.factorize()[0]
pairs.match = pairs.match.factorize()[0]

pairs.head()


In [None]:
#Filling missing values in test and train dataset
df_test.categories = df_test.categories.fillna('__NAN__')
df_test.name = df_test.name.fillna('__NAN__')
df_train['country'].fillna('NA',inplace=True)
df_train.categories=df_train.categories.fillna('__NAN__')
df_train.name = df_train.name.fillna('__NAN__')

### Explain:
1. Description of dataframe
2. Graphical plots of data
3. Descriptive statistics of data

- After reading all the csv files, the total number of rows and columns of all the data in each dataframe is shown by using the df.shape function. It can be seen that df_train has 1138812 rows and 13 columns, df_test has 5 rows and 2 columns, sample_submission has 5 rows and 2 columns and pairs has 578907 rows and 25 columns. The total number of missing data in df_train is shown by the df_train.isnull().sum() function while the total number of missing data in df_test is shown by the df_test.isnull().sum() function.

- Two bar charts are plotted to show which countries have the most data entries. In the train dataset, it can be seen from the graph that the US has the most data entries. Another graph is plotted to see which states have the most data entries in the US, which is the CA state. The graph of most frequent categories is also plotted to show which categories appear most frequently in the Train Dataset which is the Residentual Buildings(Apartments / Condos) category.

# 4. Model Building

### Describe the process of model building

In [None]:
# Show your code here (Step by Step) 
# Comment each step in your code

In [None]:
#Select features from pairs dataset to be used 
features = ['latitude_1', 'latitude_2', 'longitude_1', 'longitude_2','country_1','country_2','categories_1','categories_2','name_1','name_2']

#Assign X and y
X = pairs[features]
y = pairs.match

#Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Initialize Random Forest Classifier Model
model = RandomForestClassifier(n_jobs = -1)

#Fit X and y Train into model
model.fit(X_train, y_train)

In [None]:
# Reference from https://www.kaggle.com/code/andypenrose/spatial-neighbours-benchmark-name-and-category

#Takes the latitude and longtitude values to construct ball tree
tree = BallTree(np.deg2rad(df_test[['latitude', 'longitude']].values), metric='haversine')

In [None]:
# list for storing the points of interest
pois_out = []
# number of neighbours considered
n = min(20, len(df_test))
# max number of recommended points of interest
max_poi = 2
# max distance
max_dist_cat = 0.0005
max_dist_name = 0.005
max_dist = max(max_dist_cat, max_dist_name)

for i, row in tqdm(df_test.iterrows()):
    distances, indices = tree.query(np.deg2rad(np.c_[row['latitude'], row['longitude']]), k = n)
    poi = []
    for d, j in zip(distances[0], indices[0]):
        if d <= max_dist_cat and row['categories'] != '__NAN__' and (row['categories'] in df_test.categories.iloc[j] or df_test.categories.iloc[j] in row['categories']):
            poi.append(df_test.id.iloc[j])
        elif d <= max_dist_name and row['name'] != '__NAN__' and (row['name'].lower() == df_test.name.iloc[j].lower()):
            poi.append(df_test.id.iloc[j])
        if d > max_dist or len(poi) >= max_poi:
            break

    if len(poi) == 0:
        pois_out.append(row['id'])
    else:
        pois_out.append(' '.join(poi))

In [None]:
#Show matches
sample_submission.matches = pois_out
sample_submission.head()

In [None]:
#Copy output to csv file.
sample_submission.to_csv('submission.csv', index=False)

##1. Partitioning of data
To build the model, the pairs.csv file is used to be split into training and test set. This is done by the train_test_split function from scikitlearn. For the BallTree model, the longtitude and latitude in the test dataset is used to construct the ball tree. A query will then be exceuted with the test dataset and return two arrays which consist of the distances and indices of the neighboring locations. The indices is then used to match the correct locations.

##2. Model selection
The model selected to solve the problem is by using Random Forest Classifier which can be used to maintain accuracy of large propotion of data. Other than that, the team also tried to use the BallTree model. By using BallTree from the sklearn.neighbors library, it can be used to organise the points in a multi-dimensional space and assigned to the tree variable. It divides points based on radial distances to a centre. This is useful to solve our problem as it can approximately determine the actual distance between coordinates, which can be used to find the matches in location.

##3. Model Training
For the Random Forest Classifier model, we have selected several features to be trained after changing the attributes that contained string values to numerical values so that it can be fitted into the model. For the BallTree model, the model is trained by using the latitute and longtitude given in the test dataset to construct the ball tree model. The coordinates are transformed from degree to radian using the deg2rad function as Haversine distance is used in the BallTree function.

##4. Attribute that have greatest effect on matching result 
It can be seen that the attributes that have the greatest effect on the matching results which we used to train the model includes Country, Latitude, Longtitude, Name and Category. For the BallTree model, we used the longtitude and latitude from the test dataset.

<!-- ### Explain: 
1. how the data is partitioned
2. how the model is chosen
3. how the model is trained
4. the attributes that have the greatest effect on the matching results


1. The longtitude and latitude in the train dataset is used to construct the ball tree. A query will then be exceuted with the test dataset and return two arrays which consist of the distances and indices of the neighboring locations. The indices is then used to match the correct locations.

2. The model is chosen as it is able to deal with large datasets. In our case, a large test set will be fitted into our model to be tested. By using BallTree from the sklearn.neighbors library, it can be used to organise the points in a multi-dimensional space and assigned to the tree variable. It divides points based on radial distances to a centre. This is useful to solve our problem as it can approximately determine the actual distance between coordinates, which can be used to find the matches in location.

3. The model is trained by using the latitute and longtitude given in the test dataset to construct the ball tree model. The coordinates are transformed from degree to radian using the deg2rad function as Haversine distance is used in the BallTree function.

4. The attributes that have the greatest effect on the matching results are the longtitude and latitude. -->

# 5. Model Evaluation

### Describe the process of model evaluation

In [None]:
# Show your code here (Step by Step) 
# Comment each step in your code

In [None]:
# import library
from sklearn.metrics import jaccard_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
# use the X_test feature to predict the value of y
y_pred = model.predict(X_test)

In [None]:
# Since the submissions are evaluated by the mean Intersection over Union, which is Jaccard score
# Also include the jaccard_score in the model evaluation

iou_score = jaccard_score(y_test, y_pred)
iou_score

In [None]:
# Display the classification report to show the accuracy of the model
print(classification_report(y_test, y_pred))

In [None]:
# Display the confusion matrix to show how many observation it correctly predicted

cm = pd.DataFrame(confusion_matrix(y_test, y_pred))
cm

### Explain: 
1. the performance of the model created
* The model used above is Random Forest model. From the IoU score, we know that the similarity of y_pred and y_test is not high. And from classification report, we can tell the model has good performance, it achieves around 79% of accuracy. By looking at the Confusion Matrix, although it correctly predicted the True Positive and True Negative, still a lot of observations are wrongly labeled.


2. how the model can be used to predict or match the POIs accurately
* Our model takes in the features to train the relationship between features and match column. Our model do not directly predict or match the POI, instead if the model predicted the match output is true, that's mean the POI is accurate and correctly predicted.

# 6. Model Validation (Challenge)

### Describe the process of model validation

In [None]:
# Show your code here (Step by Step) 
# Comment each step in your code

In [None]:
from sklearn.model_selection import KFold, cross_val_score

In [None]:
#Perform K-Fold cross validation
kf=KFold(n_splits=10)
score=cross_val_score(model,X_test,y_test,cv=kf)
print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score :{}".format(score.mean()))

### Explain: 
###1. The Cross-Validation Approach
* The cross-validation approach I applied here is K-fold, the number of folds is set to 10, and using the random forest model to do the model validation. The dataset applied here is the test set after getting split from pairs.csv.

###2. The matching or predictive performance of the model created
*  The average cross validation score of the random forest model is around 0.76. The model performance is decent, can be better if we have more useful features and less missing values.

# 7. Discussion

### Identify:
###1. The factors that have significant influences on location matching
- The factors that infuences the location matching significantly are the latitude and longitude attributes.

###2. Any interesting observation from this challenge
- From this challenge, we know that the commercial points-of-interest (POI) is a immportant information to business. By knowing each shop's POI, we know that which category of shops have better place to set up their shops. We use longitude and latitude to calculate the POI, and see if it matches the other shop's POI, then the business owner can find the same POI to set up their shop.

###Explain:
###1. The limitations and weaknesses of the modelling approach
- Random Forest Classification model: The pais dataset contained many data with null values, which will cause the prediction to be inaccurate.
- BallTree model: The accuracy score cannot be obtained using our validation and evaluation methods used.

###2. The steps taken to improve the matching accuracy in your modelling approach
- From the pairs dataset, we have chosen only a few attributes to be used and dropped the attributes that are of no use. This will increase the accuracy as there are less missing values.

### Elaborate:
###1. The experience in participating in a Kaggle challenge
- The experience in participating in this particular Kaggle challenge is very interesting. On the competition page, there are many different code posted by other data scientists or users which can help one another to come up with ideas to solve the problem. The team has seen how many other data scientist develop their machine learning models.

###2. The discussion and submission score on Kaggle (include the screenshot or link to your submission here)

###3. The improvements that need to be done in order to win the challenge
- We should increase our knowledge by performing more tasks related to data science. This can help us to gain more experience in participating in a real life data science problem. Our model should also be able to return a high accuracy to win this challenge.

# Team contribution


##Participation Percentage:

##(1) Justin Liu Shan Wei (50%)
Tasks: 
- Introduction
- Data collection and understanding
- Exploratory Data Analysis
- Model Building
- Discussion

##(2) Lim Zong Xin (50%)
Tasks: 
- Model building
- Model evaluation
- Model validation
- Discussion