# Data Mining project: Discover and describe areas of interest and events from geo-located data

## 1. Import Dataset and Libraries

In [57]:
# load pandas to deal with the data
import pandas as pd
# plotting
import matplotlib.pyplot as plt
import seaborn as sns

In [77]:
# load data from table file where entries are separated with a space
data = pd.read_table("flickr_data2.csv", sep=",", low_memory=False)

data.columns = data.columns.str.strip()

print(data.columns)
print(data.info())
print(data.describe())
data.head()

## Perform Exploratory Data Analysis
First, we will explore the most common **data quality issues**:
* missing-vals
* duplicates

Second, we will use [**descriptive statistics**](#desc-stats) to have get a statistical summary of the data. 

We will then use [**data visualisaiton**](#data-vis) to get a better understanding of the data.

### Missing Values

To check the missing values, several approaches can be used:

1. The `info()` mwthods provides a summary of a dataframe in terms of the types of values, non-null values and memory usage. Thus, by comparing the number of non-null values of each column with the total number of entries, one can have an idea of missing values.
2. Using the [`isna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html) method. By summing the resulting values, we obtain the number of null values for each column.
3. To get the rows with any missing values, you can use `isna()` followed by `any(axis=1)`.

In [100]:
print(f"Initial: {len(data)}")
# remove rows with missing values on the columns id, lat, and long
data_cleaned_missing_values = data.dropna(subset=['id', 'lat', 'long'])
print(f"After removing missing values: {len(data_cleaned_missing_values)}")

Initial: 420240
After removing missing values: 420240


### Removing duplicates

In [104]:
# remove duplicates
print(f"Initial: {len(data_cleaned_missing_values)}")
print(data_cleaned_missing_values.duplicated().sum())
data_cleaned_duplicates = data_cleaned_missing_values.drop_duplicates(subset=['id', 'lat', 'long'],keep='first')
# show the stats
print(f"After removing duplicates: {len(data_cleaned_duplicates)}")

Initial: 420240
252133
After removing duplicates: 168097


### Descriptive Statistics

To obtain the statistical summary of the dataframe, we can use [`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html). For different columns, it displays the count, the average value, the standard deviation, the min and max values, percentiles. 
By default, in mixed data types DataFrames, it displays the values for quantative data only:

In [105]:
data_cleaned_duplicates.describe()

Unnamed: 0,id,lat,long,date_taken_minute,date_taken_hour,date_taken_day,date_taken_month,date_taken_year,date_upload_day,date_upload_month,date_upload_year,Unnamed: 16,Unnamed: 17,Unnamed: 18
count,168097.0,168097.0,168097.0,168097.0,168097.0,168097.0,168097.0,168097.0,168096.0,168097.0,168096.0,47.0,0.0,1.0
mean,19708670000.0,45.768488,4.839516,29.932592,14.771441,15.05498,7.095724,2013.383445,15.519245,6.849438,2013.734437,1929.893617,,2012.0
std,13753330000.0,0.028839,0.031621,36.871581,6.93436,9.971281,5.953721,34.03619,8.484994,6.00787,33.675077,407.214783,,
min,306667500.0,45.6552,4.720312,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,12.0,,2012.0
25%,7421252000.0,45.757613,4.826195,14.0,12.0,8.0,4.0,2012.0,8.0,4.0,2012.0,2013.0,,2012.0
50%,15367100000.0,45.763275,4.832174,30.0,15.0,14.0,7.0,2014.0,15.0,7.0,2014.0,2015.0,,2012.0
75%,31339390000.0,45.773811,4.846494,45.0,18.0,23.0,10.0,2017.0,23.0,10.0,2017.0,2016.0,,2012.0
max,49148090000.0,45.85495,5.006709,2019.0,2013.0,2013.0,2011.0,2238.0,31.0,2011.0,2019.0,2019.0,,2012.0


## Prepare data for clustering

First, we will droping the columns user, tag and title because they are not necessary for geographic clustering

In [114]:
df_clustering = data_cleaned_duplicates.drop(columns=['user'])
df_clustering = data_cleaned_duplicates.drop(columns=['tags'])
df_clustering = data_cleaned_duplicates.drop(columns=['title'])

df_clustering.head()

Let's apply a [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). Recall, that for a given value `x`, a standard score is given by $z = \frac{x - mean(\mathbf{x})}{std(\mathbf{x})}$ 

In [115]:
# scaler
from sklearn.preprocessing import StandardScaler

In [117]:
# Scale the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_clustering)
# show
print(scaled_data)
# create a DataFrame
scaled_data_df = pd.DataFrame(data=scaled_data, columns=df_clustering.columns)
scaled_data_df.head()

ValueError: could not convert string to float: '30624617@N03'